Category Archives: Data science

Data processing with Spark in R & Python

I recently gave a talk on data processing with Apache Spark using R and Python. tl;dr – the slides and presentation can be accessed below:

As noted in my previous post, Spark has become the defacto standard for big data applications and has been adopted quickly by the industry. See Cloudera’s  One Platform initiative blog post by CEO Mike Olson for their commitment to Spark.

In data science R had seen rapid adoption, not only because it was open source and free compared to costly SAS, but also the huge number of statistical and graphical packages provided by R for data science. The most popular ones of course are the ones from Hadley Wickham (dplyr, ggplot2, reshape2, tidyr and more). On the other hand, Python had seen rapid adoption among developers and engineers due to its being useful to script big data tasks along with data analysis with the help of packages like pandas, scikit-learn, NumPy, SciPy, matplotlib etc. and also the popular iPython & later Jupyter notebooks.

There are numerous posts strewn on the net picking fights between R and Python. However it is quite usual for any big data and data science shop to have developers and data scientists who use either or both these tools. Spark makes it easy for both communities to leverage the power of Hadoop and distributed processing systems with its own APIs like DataFrames which can be used in a polyglot fashion. Therefore it is essential for any data enthusiast to learn about how data processing in Spark can be done using R or Python.


What roles do you need in your data science team?

Over the past few weeks, we’ve had several conversations in our data lab regarding data engineering problems and day to day problems we face with unsupervised data scientists who find it difficult to deploy their code into production.

Data scientist

The data scientist

The opinions from business seemed to cluster around a tacit definition of data scientists as researchers, primarily from statistics or mathematics backgrounds, who are experienced in machine learning algorithms and often in some domain areas specific to our business, (e.g. actuaries in insurance), but not necessarily having skills of writing production-ready code.
The key driver behind the somewhat opposing strain of thought came from the developers and data engineers who often quoted Cloudera’s Director of Data Science – Josh Wills – famous for his “definition of a data scientist tweet”:
Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.

Josh Wills' definition of data scientist

Josh Wills’ definition of data scientist

Wills’ quote reflects the practical issues in finding “unicorn” data scientists and having to do with the best of what’s on offer for a multi-disciplinary area like data science. It’s also perhaps based on his work in startups like Cloudera and web giants like Google, where adopting agile practices like DevOps allow data scientists closer interaction with engineers and therfore substantial experience in deploying to production. Unfortunately, that’s always a challenge due to bureaucracy, mindset, lack of informed opinion and cultural barriers in larger or old-world organizations with legacy systems and practices.

As in any startup or lab working on problems in data science and big data, it’s important for us to clear misconceptions and get the team to a shared understanding of commonly used terms to establish a foundational common language, which would then allow developing a shared vision around our objectives. Therefore it’s necessary to review going beyond definitions of the “unicorn” data scientists and looking at what happens in real-life teams where data scientists work, like ours.

Different perspectives
A lot of the data scientists actually think of themselves as mathematicians, trying to formulate business problems into math/statistics problems and then trying to solve them in the data science projects.
However, the popular misconception arise sometimes out of the big-data hype articles churned out by big data vendors, including some evangelists – who equate data scientists with superpowers across a multitude of disciplines.
The developer’s views arise due to their unique perspectives on the complexities of data wrangling and fragmentation around tools, technologies and languages.

The reality, as always, is quite different from the hype. There are actually probably just a handful of the “unicorn” data scientists on the planet, who have superpowers in maths/stats,AI/machine learning, a variety of programming languages, an even wider variety of tools and techniques, and of course are great in understanding business problems and articulating complex models and maths in business-speak. For the lesser mortals, and less fortunate businesses, we have to do with multiple individuals to combine these skillsets together into a team or data science squad.

Building data science teams

In terms of hiring, building a data science team becomes much easier, once we get around the idea that the “unicorn” data scientists are not really available. The recruitment team and hiring manager can then focus on the individual skills that are required on the team and try to hire for profiles with strengths in these skills. Once hired, the manager’s role switches to building the team in terms of setting expectations and facilitating collaborative team dynamics to evolve self-governing teams, which can then focus on achieving the objectives in a collaborative manner, instead of having to be superheroes.

Dream data science team?

Dream data science team? Einstein, Fisher, Tufte, Jobs

The roles in a data science team

So what roles would a data science team have? Depending upon the organizations’ objectives, the team could either focus on service-oriented consulting gigs, or focus on building reusable assets or data products.

  • Data scientist – this role would be primarily of someone who can work on large datasets (usually on Hadoop/Spark) with machine learning algorithms, develop predictive models, understand the “theory” – maths and stats behind the models and can interpret and explain model behavior in jargon-free language. Typically this role requires good knowledge of SQL and familiarity with at least one programming language for predictive data analysis e.g. R and/ Python.
Netflix requirements for data scientist role

Netflix requirements for data scientist role

  • Data engineer / Data software developer – this role is for someone who has good understanding of distributed programming, including insfrastructure and architecture. Typically this person is comfortable with installation of distributed programming frameworks like Hadoop MapReduce/Spark clusters, can code in more than one programming languages like Scala/Python/Java, and knows Unix scripting and SQL. Based on range and depth of experience, this role can evolve into one of the two specialized roles – that of the data solutions architect and the data platform administrator.
Netflix requirements for data engineer role

Netflix requirements for data engineer role

  • Data solutions architect – A data engineer with a range of deep experience across several distributed technologies, and who also has good understanding of service-oriented architecture concepts and web applications (SOA concepts and REST frameworks) in addition to the developer skillsets.
  • Data platform administrator – A data engineer who has extensive experience across distributed technologies, especially managing clusters including production envionments and good knowledge of cloud computing architectures (public clouds like AWS if using public cloud or OpenStack and Linux sysadmin experience if using private/hybrid clouds)
  • Full-stack developer – This is an optional role – only required for teams which are focused on building data products with web interfaces. The full-stack developer is ideally an experienced web developer with experience in both backend and front-end e.g. a MEAN developer with experience on MongoDB, Express, AngularJS and NodeJS.
  • Designer – this role demands an expert who has deep knowledge of user experience (UX) and interface design, primarily for web/mobile applications depending on target form factors of the data product as well as data visualization and desirably some UI coding expertise. Building quick mockups and wireframes design is often required during product definition, and the designer needs to be able to work with business as well as developers in a collaborative fashion. Sometimes this role is played by front-end UI developers as good designers don’t come cheap.
Netflix requirements for UX designer

Netflix requirements for UX designer

  • Product manager – This is an optional role – only required (but the key one) for teams focused on building data products. Defining the product vision, translating business problems into user stories, and focusing on getting the development team to build data products based on the user stories, aligning product releases and overall roadmap to business requirements and expectations is a key requirement from this role. Having product management experience along with relevant technical expertise is critical for this role due to differences in life-cycles of products and IT projects, as also the ability to present the voice of the customer and balance long-term vision with short-term needs. Back-filling this role with data scientists/developers who do not have product vision/business acumen is dangerous due to lures of gold-plating and lack of project management skills.
Google product manager role

Google requirements for product manager

  • Project manager role may also be optionally required when the team is low on experience. In most successful cases of performance, managers set the objectives and expectations and facilitate to build self-governing teams following agile practices.

Irrespective of whether the data science teams focus on consulting services in one-off projects or build data products which are reused, in both cases, the team would still require a minimum foundation to build on – in terms of processes or shared understanding, and tools and platforms to perform the actual work. We’ll review the data engineering requirements for such tools and platforms in the next post.

An introduction to Data Science

I presented a talk last week introducing Data Science and associated topics to some enthusiasts.
Here’s a slide deck I created quickly with markdown using Swipe – a start-up building HTML5 presentation tools.
Here are the slides:

A gentle introduction to Machine Learning

Machine Learning is a big part of big data and data science. A subset of artificial intelligence – a branch of science notorious for requiring advanced knowledge of mathematics. In practice though, most data scientists don’t try to build a Chappie  and there are simpler, practical ways to get started with machine learning.

Gmail Priority Inbox

Gmail Priority Inbox

Machine learning in practice involves predictions based on data. Notable examples include Amazon’s product recommendations with the “customers also bought” scroll-list, or Gmail’s priority inbox or any email spam-filter feature. How do these work? For Amazon, clicks by the user is used to learn and predict user behavior and propensity (likelihood) to buy certain items. The items the user is most likely to buy are then displayed on the recommendation system. Gmail’s system learns from the messages which the user reads and/replies to and prioritizes them.

Amazon Recommendations

Amazon Recommendations

In both cases, there are some predictions made based on certain example usage of the data. Thus in essence, machine learning is about predictions based on models, which themselves are created based on examples.

More specifically, a machine learning model is a set of explanations of the relationships between the input data and the output predictions. These relationships are discovered from examples of input-output pairs. In machine learning terminology, the input data is also called features and the predictions are called output. Once a model has been created, it can be used on new inputs to predict outputs.

Machine learning models therefore “learn” to predict from examples. This learning is also known as “training” the model, and the associated good quality data-set is called “training data-set“. The stage where the model is used on new inputs is known as “testing” the model, and the data-set associated with it is called “test data-set“.
There are different ways to perform this learning, with different types of algorithms to build models and perform predictions. Most common among these are Classification and Regression techniques. The Gmail spam-filter is an example of Classification technique. Given a set of emails marked as spam or not-spam, it learns the characteristics of emails and is then able to process future email messages to mark them as spam or not. Classification deals with prediction of which class ordinal data fits in, while regression deals with prediction of continuous numeric data. Example of regression is a best-fit line drawn through some data points for generalization. Both classification and regression are examples of supervised learning, as the algorithm is told to predict a label or target value.

The opposite of this is unsupervised learning – where there’s no label or target value given for the data. An example of this could be clustering – a task of grouping a items so that objects in the same group or cluster are more similar (in some sense or another) to each other than to those in other groups (clusters).

Machine Learning Techniques

Machine Learning Techniques

With so many choices, how do you choose the right algorithm? Without considering nuances of the data, a rule of thumb is to look at the objective of the prediction:

  • If the prediction is to forecast a target value, we use supervised learning, else use unsupervised learning or density estimation algorithms.

It is important to note though, that this is not unbreakable, rather usage of algorithms is rather fuzzy. This is quite common in machine learning, where most problems are not deterministic in nature, and often a bunch of different algorithms are tried out to see how they perform. There are also ensemble models like Gradient Boosting – a regression technique which  uses an ensemble of weak prediction models, typically decision trees to get an improved prediction model. An interesting tool based on symbolic regression, which infers the model from the data is Nutonian Eureqa, also dubbed as the robotic data scientist.

Many algorithms are different, but the steps to use one are similar:

Collect data > Prepare data > Analyze input data > Clean data/verify data quality > Train the algorithm > Test the algorithm > Iterate/Deploy. (See also my earlier post on the data science project lifecycle)

As with many other aspects (data wrangling) of data science, both R and Python are very popular languages for using machine learning techniques. There are also start-ups like BigML providing MLaaS or Machine Learning as a Service.

In conclusion of this post, a few points to remember: garbage in – garbage out:- data quality matters as much if not more than algorithms, quantity of data or complexity of algorithm are not substitutes for quality, and of course as with all predictions, machine learning can be wrong as well.

Designing the future – Data Innovation Labs

With the ongoing Big data revolution, and the impending Internet of Things revolution, there has been a renewed enthusiasm in “innovation” around data.  Similar to the Labs concept started by Google (think Gmail Beta based on Ajax, circa 2004), more and more organizations, business communities, governments and countries are setting up Labs to foster innovation in data and analytics technologies. The idea behind these “data innovation labs” is to develop avant-garde data and analytics technologies and products in an agile fashion and move quickly from concept to production. Given the traditional bureaucratic setup in large organizations and governments, these Labs stand a better chance of fostering a culture of innovation, due to their being autonomous entities and their startup-mode culture leveraging agile methodologies.

Data Innovation Labs

Data Innovation Labs

Below I list a few of the data innovation labs that have been setup to get value for their parent entities in the data and analytics space, trying to build data products in the Big Data and Data Science fields.

  • Data Innovation Lab – Thomson Reuters
    • Small group of about ten people, partnering with internal teams, third-parties and customers to find data-driven innovations
    • experiments with mash-ups of internal and external data in novel ways
    • hosts internal crowd-sourcing competitions
    • translate business problem into technical data problem statement
    • created Exchange – a digital forum for sharing ideas and insights
    • Partners with Central Strategy to estimate potential and market-size for new data innovation opportunities
  • The Data Lab – Scotland
    • mission is to strengthen Scotland’s local industry and transfer world-leading research in informatics and computer science in the global marketplace
    • focuses on skills and training by working with industry  to create  a pipeline of talented data scientists equipped with the relevant skills
    • connects world-leading researchers and data scientists with local industry and public sector organizations, giving them access to experts who can help collaborate on solutions to key problems
  • Smart Data Innovation Lab – German government
    • Hosted at Karlsruhe Institute for Technology, its mission is to turn big data into smart data
    • plans to store data centrally in a highly secured environment for research purposes
    • has cutting-edge insfrastructure for processing Big Data including software like SAP HANA, IBM Watson and hardware on IBM Power and Intel architectures
    • Industry partners to deliver data sources directly from the practice environment, to be complemented with crowdsourced data and open data
    • plans to offer an open source repository for reuse in research
  • Midata Innovation Lab – UK government
    • An organization run by the Department of  Business Innovation & Skills with involvement of industry
    • Accelerator for businesses to create new services for consumers, from data
    • work involves concept of personal data stores (PDS) or personal clouds
    • working with three leading PDS – Allfiled, Mydex and Paoga
    • Participating organizations and developers use PDS to create new innovative services for consumers
  • Nordstrom Innovation Lab – Nordstrom
    • Internal technology lab focused on innvoation around technology
    • Secondary focus – but still in scope: operations, products, business models and management
    • Goal is to deliver data-driven products to inform business decisions internally, and to enhance customer experience externally
    • Multi-disciplinary team of techies, designers, entrepreneurs, statisticians, researchers and artists
  • GFDRR Innovation Lab – World Bank
    • Global facility for disaster reduction and recovery, a global partnership, managed by the World Bank and funded by 25 donor partners
    • supports use of science, technology, open data and innovation to empower decision-makers to increase their resilience
    • tries to apply the concepts of the global open data movement to the challenges of reducing vulnerability to natural hazards and  the impacts of climate change through OpenDRI (Open Data for Resilience Initiative)
  • Big Data Innovation Center and Innovation Lab – SAP
    • Focus on SAP’s mobile and cloud portfolio
    • Mission is to extend SAP stack and develop innovative data-driven process applications leveraging an integrated platform and next-generation DB technologies
    • Partnership and exchanges with leading schools including Stanford, MIT, Berlin universities
    • Short, fast-paced innovation cycles
    • Project run-times of a few months on an average
    • Hands over prototypes to SAP development for turning into market-ready products

Set up a Hadoop Spark cluster in 10 minutes with Vagrant

With each of the big 3 Hadoop vendors – Cloudera, Hortonworks and MapR each providing their own Hadoop sandbox virtual machines (VMs), trying out Hadoop today has become extremely easy. For a developer, it is extremely useful to download a get started with one of these VMs and try out Hadoop to practice data science right away.

Vagrant Hadoop Spark Cluster

Set up a Hadoop-Spark cluster with Vagrant in 10 minutes

However, with the core Apache Hadoop, these vendors package their own software into their distributions, mostly for the orchestration and management, which can be a pain due to the multiple scattered open-source projects within the Hadoop ecosystem. e.g. Hortonworks includes the open-source Ambari while Cloudera includes its own Cloudera Manager for orchestrating Hadoop installations and managing multi-node clusters.

Moreover, most of these distributions require today a 64-bit machine and sometimes a high-amount of memory (for a laptop). e.g. running Cloudera Manager with a full-blown Cloudera Hadoop Distribution (CDH) 5.x requires at least 10GB RAM. For a developer with a laptop, RAM is always at a premium, hence it may seem easier to try out the vanilla Apache Hadoop downloads for installations. The documentation for Hadoop for installing a single-node cluster, and even a multi-node cluster is much improved nowadays, but with the hassles of downloading the distributions and setting up SSH, it can easily take up a long-time to effectively set up a useful multi-node cluster. The overhead of setting up and running multiple VMs can also be a challenge. The vanilla distributions also require separate installations for UI (Cloudera Hue being a nice one) and job tracking (Oozie) or orchestration (Ambari). Unfortunately Ambari works only with select versions of HDP (Hortonwork’s distribution of Hadoop), and configuring Oozie with disparate versions of Hadoop, Java and other libraries can be a real pain.

One of the solutions to this problem is to use a container-based approach to installation. Hadoop clusters can be setup with LXC (Linux containers) approach, e.g. with the very popular Docker. There are also other approaches with using Puppet, Ansible, Chef and Salt which allow easy installations. One of the simpler approaches that I tried apart from vanilla Hadoop is using Vagrant. Indeed setting up VMs with Vagrant is a breeze, and with a vagrant script (written in Ruby), setting up a multi-node cluster is very quick. In fact you can get started with a Hadoop and Spark multi-node cluster in less than 10 minutes.

Check out the project on Github – it’s adapted from Jee Vang’s excellent Vagrant project to allow for 32-bit machines,  speed-up with pre-downloads of Hadoop, Spark and Java, and includes an updated Readme with script change locations detailed.

The data science project lifecycle

How does the typical data science project life-cycle look like?

This post looks at practical aspects of implementing data science projects. It also assumes a certain level of maturity in big data (more on big data maturity models in the next post) and data science management within the organization. Therefore the life cycle presented here differs, sometimes significantly from purist definitions of ‘science’ which emphasize the hypothesis-testing approach. In practice, the typical data science project life-cycle resembles more of an engineering view imposed due to constraints of resources (budget, data and skills availability) and time-to-market considerations.

The CRISP-DM model (CRoss Industry Standard Process for Data Mining) has traditionally defined six steps in the data mining life-cycle. Data science is similar to data mining in several aspects, hence there’s some similarity with these steps.

CRISP-DM lifecycle

CRISP-DM lifecycle

The CRISP model steps are:
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation and
6. Deployment

Given a certain level of maturity in big data and data science expertise within the organization, it is reasonable to assume availability of a library of assets related to data science implementations. Key among these are:
1. Library of business use-cases for big data/ data science applications
2. Data requirements – business use case mapping matrix
3. Minimum data quality requirements (test cases to ensure minimum level of data quality to ensure feasibility)

In most organizations, data science is a fledgling discipline, hence data scientists (except those from actuarial background) are likely to have limited business domain expertise – therefore they need to be paired with business people and those with expertise in understanding the data. This helps data scientists gain or work together on steps 1 and 2 of the CRISM-DM model – i.e. business understanding and data understanding.

The typical data science project then becomes an engineering exercise in terms of a defined framework of steps or phases and exit criteria, which allow making informed decisions on whether to continue projects based on pre-defined criteria, to optimize resource utilization and maximize benefits from the data science project. This also prevents the project from degrading into money-pits due to pursuing nonviable hypotheses and ideas.

The data science life-cycle thus looks somewhat like:
1. Data acquisition
2. Data preparation
3. Hypothesis and modeling
4. Evaluation and Interpretation
5. Deployment
6. Operations
7. Optimization

Data Science Project Life-cycle

Data Science Project Life-cycle

Data Acquisition – may involve acquiring data from both internal and external sources, including social media or web scraping. In a steady state, data extraction and transfer routines would be in place, and new sources, once identified would be acquired following the established processes.
Data preparation – Usually referred to as “data wrangling”, this step involves cleaning the data and reshaping it into a readily usable form for performing data science. This is similar to the traditional ETL steps in data warehousing in certain aspects, but involves more exploratory analysis and is primarily aimed at extracting features in usable formats.
Hypothesis and modeling are the traditional data mining steps – however in a data science project, these are not limited to statistical samples. Indeed the idea is to apply machine learning techniques to all data. A key sub-step is performed here for model selection. This involves the separation of a training set for training the candidate machine-learning models, and validation sets and test sets for comparing model performances and selecting the best performing model, gauging model accuracy and preventing over-fitting.

Steps 2 through 4 are repeated a number of times as needed; as the understanding of data and business becomes clearer and results from initial models and hypotheses are evaluated, further tweaks are performed. These may sometimes include Step 5 (deployment) and be performed in a pre-production or “limited” / “pilot” environment before the actual full-scale “production” deployment, or could include fast-tweaks after deployment, based on the continuous deployment model.

Once the model has been deployed in production, it is time for regular maintenance and operations. This operations phase could also follow a target DevOps model which gels well with the continuous deployment model, given the rapid time-to-market requirements in big data projects. Ideally, the deployment includes performance tests to measure model performance, and can trigger alerts when the model performance degrades beyond a certain acceptable threshold.

The optimization phase is the final step in the data science project life-cycle. This could be triggered by failing performance, or due to the need to add new data sources and retraining the model, or even to deploy improved versions of the model based on better algorithms.

Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. As mentioned before, with increasing maturity and well-defined project goals, pre-defined performance criteria can help evaluate feasibility of the data science project early enough in the life-cycle. This early comparison helps the data science team to change approaches, refine hypothesis and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.