Data processing with Spark in R & Python

I recently gave a talk on data processing with Apache Spark using R and Python. tl;dr – the slides and presentation can be accessed below:

As noted in my previous post, Spark has become the defacto standard for big data applications and has been adopted quickly by the industry. See Cloudera’s  One Platform initiative blog post by CEO Mike Olson for their commitment to Spark.

In data science R had seen rapid adoption, not only because it was open source and free compared to costly SAS, but also the huge number of statistical and graphical packages provided by R for data science. The most popular ones of course are the ones from Hadley Wickham (dplyr, ggplot2, reshape2, tidyr and more). On the other hand, Python had seen rapid adoption among developers and engineers due to its being useful to script big data tasks along with data analysis with the help of packages like pandas, scikit-learn, NumPy, SciPy, matplotlib etc. and also the popular iPython & later Jupyter notebooks.

There are numerous posts strewn on the net picking fights between R and Python. However it is quite usual for any big data and data science shop to have developers and data scientists who use either or both these tools. Spark makes it easy for both communities to leverage the power of Hadoop and distributed processing systems with its own APIs like DataFrames which can be used in a polyglot fashion. Therefore it is essential for any data enthusiast to learn about how data processing in Spark can be done using R or Python.


What roles do you need in your data science team?

Over the past few weeks, we’ve had several conversations in our data lab regarding data engineering problems and day to day problems we face with unsupervised data scientists who find it difficult to deploy their code into production.

Data scientist

The data scientist

The opinions from business seemed to cluster around a tacit definition of data scientists as researchers, primarily from statistics or mathematics backgrounds, who are experienced in machine learning algorithms and often in some domain areas specific to our business, (e.g. actuaries in insurance), but not necessarily having skills of writing production-ready code.
The key driver behind the somewhat opposing strain of thought came from the developers and data engineers who often quoted Cloudera’s Director of Data Science – Josh Wills – famous for his “definition of a data scientist tweet”:
Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.

Josh Wills' definition of data scientist

Josh Wills’ definition of data scientist

Wills’ quote reflects the practical issues in finding “unicorn” data scientists and having to do with the best of what’s on offer for a multi-disciplinary area like data science. It’s also perhaps based on his work in startups like Cloudera and web giants like Google, where adopting agile practices like DevOps allow data scientists closer interaction with engineers and therfore substantial experience in deploying to production. Unfortunately, that’s always a challenge due to bureaucracy, mindset, lack of informed opinion and cultural barriers in larger or old-world organizations with legacy systems and practices.

As in any startup or lab working on problems in data science and big data, it’s important for us to clear misconceptions and get the team to a shared understanding of commonly used terms to establish a foundational common language, which would then allow developing a shared vision around our objectives. Therefore it’s necessary to review going beyond definitions of the “unicorn” data scientists and looking at what happens in real-life teams where data scientists work, like ours.

Different perspectives
A lot of the data scientists actually think of themselves as mathematicians, trying to formulate business problems into math/statistics problems and then trying to solve them in the data science projects.
However, the popular misconception arise sometimes out of the big-data hype articles churned out by big data vendors, including some evangelists – who equate data scientists with superpowers across a multitude of disciplines.
The developer’s views arise due to their unique perspectives on the complexities of data wrangling and fragmentation around tools, technologies and languages.

The reality, as always, is quite different from the hype. There are actually probably just a handful of the “unicorn” data scientists on the planet, who have superpowers in maths/stats,AI/machine learning, a variety of programming languages, an even wider variety of tools and techniques, and of course are great in understanding business problems and articulating complex models and maths in business-speak. For the lesser mortals, and less fortunate businesses, we have to do with multiple individuals to combine these skillsets together into a team or data science squad.

Building data science teams

In terms of hiring, building a data science team becomes much easier, once we get around the idea that the “unicorn” data scientists are not really available. The recruitment team and hiring manager can then focus on the individual skills that are required on the team and try to hire for profiles with strengths in these skills. Once hired, the manager’s role switches to building the team in terms of setting expectations and facilitating collaborative team dynamics to evolve self-governing teams, which can then focus on achieving the objectives in a collaborative manner, instead of having to be superheroes.

Dream data science team?

Dream data science team? Einstein, Fisher, Tufte, Jobs

The roles in a data science team

So what roles would a data science team have? Depending upon the organizations’ objectives, the team could either focus on service-oriented consulting gigs, or focus on building reusable assets or data products.

  • Data scientist – this role would be primarily of someone who can work on large datasets (usually on Hadoop/Spark) with machine learning algorithms, develop predictive models, understand the “theory” – maths and stats behind the models and can interpret and explain model behavior in jargon-free language. Typically this role requires good knowledge of SQL and familiarity with at least one programming language for predictive data analysis e.g. R and/ Python.
Netflix requirements for data scientist role

Netflix requirements for data scientist role

  • Data engineer / Data software developer – this role is for someone who has good understanding of distributed programming, including insfrastructure and architecture. Typically this person is comfortable with installation of distributed programming frameworks like Hadoop MapReduce/Spark clusters, can code in more than one programming languages like Scala/Python/Java, and knows Unix scripting and SQL. Based on range and depth of experience, this role can evolve into one of the two specialized roles – that of the data solutions architect and the data platform administrator.
Netflix requirements for data engineer role

Netflix requirements for data engineer role

  • Data solutions architect – A data engineer with a range of deep experience across several distributed technologies, and who also has good understanding of service-oriented architecture concepts and web applications (SOA concepts and REST frameworks) in addition to the developer skillsets.
  • Data platform administrator – A data engineer who has extensive experience across distributed technologies, especially managing clusters including production envionments and good knowledge of cloud computing architectures (public clouds like AWS if using public cloud or OpenStack and Linux sysadmin experience if using private/hybrid clouds)
  • Full-stack developer – This is an optional role – only required for teams which are focused on building data products with web interfaces. The full-stack developer is ideally an experienced web developer with experience in both backend and front-end e.g. a MEAN developer with experience on MongoDB, Express, AngularJS and NodeJS.
  • Designer – this role demands an expert who has deep knowledge of user experience (UX) and interface design, primarily for web/mobile applications depending on target form factors of the data product as well as data visualization and desirably some UI coding expertise. Building quick mockups and wireframes design is often required during product definition, and the designer needs to be able to work with business as well as developers in a collaborative fashion. Sometimes this role is played by front-end UI developers as good designers don’t come cheap.
Netflix requirements for UX designer

Netflix requirements for UX designer

  • Product manager – This is an optional role – only required (but the key one) for teams focused on building data products. Defining the product vision, translating business problems into user stories, and focusing on getting the development team to build data products based on the user stories, aligning product releases and overall roadmap to business requirements and expectations is a key requirement from this role. Having product management experience along with relevant technical expertise is critical for this role due to differences in life-cycles of products and IT projects, as also the ability to present the voice of the customer and balance long-term vision with short-term needs. Back-filling this role with data scientists/developers who do not have product vision/business acumen is dangerous due to lures of gold-plating and lack of project management skills.
Google product manager role

Google requirements for product manager

  • Project manager role may also be optionally required when the team is low on experience. In most successful cases of performance, managers set the objectives and expectations and facilitate to build self-governing teams following agile practices.

Irrespective of whether the data science teams focus on consulting services in one-off projects or build data products which are reused, in both cases, the team would still require a minimum foundation to build on – in terms of processes or shared understanding, and tools and platforms to perform the actual work. We’ll review the data engineering requirements for such tools and platforms in the next post.

An introduction to Data Science

I presented a talk last week introducing Data Science and associated topics to some enthusiasts.
Here’s a slide deck I created quickly with markdown using Swipe – a start-up building HTML5 presentation tools.
Here are the slides:

Why Spark is the big data platform of the future

Apache Spark has created a lot of buzz recently. In fact, beyond the buzz, Apache Spark has seen phenomenal adoption and has been marked out as the successor to Hadoop MapReduce.

Apache Spark

Apache Spark

Google Trends confirms the hockey stick like growth in interest in Apache Spark.  All leading Hadoop vendors, including Cloudera, now include Apache Spark in their Hadoop distribution.

GoogleTrends - Apache Spark

GoogleTrends – Interest in Apache Spark

So what exactly is Spark, and why has it generated such enthusiasm? Apache Spark is an open-source big data processing framework designed for speed and ease of use.  Spark is well-known for its in-memory performance, but that has also given rise to misconceptions about its on-disk abilities. Spark is in fact a general execution engine – which has a greatly improved performance both in-memory as well as on-disk, when compared with older frameworks like MapReduce. With its advanced DAG (directed acyclic graph) execution engine, Spark can run programs up to 100x faster than MapReduce in memory, or 10x faster on-disk. Why is Spark faster than MapReduce?

  • A key step during MapReduce operations is the synchronization or  “shuffle” step, intermediate between the “map”-step and the “reduce”-step. Apache Spark implements a sort-based shuffle design, which improves performance.
  • Apache Spark also includes a DAG (directed-acyclic graph) which allow developers to execute DAGs all at once, not step by step. This eliminates the costly synchronization required by MapReduce. Note that DAGs are also used by Storm and Tez
  • Spark supports in-memory data sharing across DAGs, so different jobs can work with the same data at a very high speed.

It’s important to remember that Hadoop is a decade-old technology, developed at a time when memory was still relatively expensive, and therefore took the design approach of persistence to disk as a way of maintaining state between execution steps. On the other hand, Spark was developed at UC Berkeley AMPLab in 2009 and then it was open-sourced in 2010 – when memory had become much cheaper. Therefore, Spark stores data in memory and transparently persists it to disk if needed, thereby achieving better performance. The core concept of Spark is this programming abstraction over data storage – called RDDs (Resilient Distributed Dataset). Under the hood, Spark automatically distributes the data contained in RDDs across the cluster and parallelizes the operations performed on them.

Word-count code in Spark

Word-count code in Spark

The end result is that, on an average – the lines of code required to develop a distributed data processing program is much less in Spark, when compared with MapReduce. See more details on why Spark is the fastest open-source engine for sorting a petabyte. Clearly, faster execution has been one of the key reasons for the uptake of Spark, but Spark also provides further advantages. Similar to YARN, the upgrade of the Hadoop framework over the MapReduce-only version, Spark allows a wide range of workloads from batch to interactive and streaming. It reduces the burden of maintaining separate tools as in Hadoop – and provides APIs in Scala, Java, Python and SQL.  Spark can run over a variety of cluster-managers, including Hadoop YARN, Apache Mesos, and Spark’s own standalone scheduler. Spark components



Spark Core – provides basic functionality of Apache Spark, including RDDs and APIs to manipulate them. Spark SQL – A new component which replaces the older Shark (SQL on Spark) project, this package provides better integration with Spark Core, it allows querying data through SQL and HiveQL and supports many data sources from Hive tables, Parquet and JSON. Spark SQL also allows developers to intermix SQL queries with the code for data manipulations with RDDs in Python, Java, and Scala. It also provides fast SQL connectivity to BI tools like Tableau or QlikView.



Spark Streaming – based on micro-batching, this component enables processing of real-time streaming data. It uses DStreams, which are series of RDDs, to process real-time data. The Spark Streaming API is very similar to the Spark Core RDD APIs, making it easy for developers to reuse and adapt code for batch to interactive or real-time applications. MLlib – provides a library of machine learning algorithms including classification, regression, clustering, and collaborative filtering, as well as model evaluation and data import. GraphX – provides an API for graphs and graph-parallel computations and operators for manipulating graphs and a library of graph algorithms. The SparkR project aims to provide a light-weight front-end to use Apache Spark from R. Work is on to integrate SparkR into Spark. Recently, Spark has introduced a dataframe library with R/Pandas syntax for use across all of the Spark language APIs and an ML pipeline API which also integrates with data frames. Spark adoption is increasing manifold, boosted by increased third-part vendor support. Databricks – the company spun out of AMPLab by the creators of Apache Spark, now provides Spark as a service on the cloud – with its own Databricks Cloud – which is in private beta. The Databricks cloud is designed to support data science in the lab as well as in the factory – by creating polyglot notebooks (mix of Scala/Java/Python/SQL possible) and building production pipelines for ETL and analytics jobs. Tableau and MemSQL have provided Spark connectors, Altiscale now provides Spark in the cloud and machine learning vendors like Nube are building products like Reifier to perform entity resolution and de-duplication using Spark. ClearStory Data provides Spark-based data processing and analytics. There is also a fledgling community of packages for Apache Spark. Big data and data science projects are complex with an increasing diverse toolset which require massive integration efforts. Greater flexibility than that provided by MapReduce, capability to support a variety of workloads and a simpler, more unified ecosystem of tools which work out of the box on a general execution engine (Apache Spark) thus provide better simplicity than the complex zoo of Hadoop MapReduce projects. Together with SparkSQL and dataframes library, Spark democratizes access to distributed data processing beyond MapReduce programmers extending it to other developers and business analysts. Over and above, considering the fast performance of Spark, it is no wonder that Apache Spark continues to gain traction and looks all set to be the default framework for Big data processing in the near future. More info:

Read the series on Big Data: Part-1 : Basics, Part-2 : Hadoop, Part-3 : Hadoop data warehouse and Part-4 : NoSQL

A gentle introduction to Machine Learning

Machine Learning is a big part of big data and data science. A subset of artificial intelligence – a branch of science notorious for requiring advanced knowledge of mathematics. In practice though, most data scientists don’t try to build a Chappie  and there are simpler, practical ways to get started with machine learning.

Gmail Priority Inbox

Gmail Priority Inbox

Machine learning in practice involves predictions based on data. Notable examples include Amazon’s product recommendations with the “customers also bought” scroll-list, or Gmail’s priority inbox or any email spam-filter feature. How do these work? For Amazon, clicks by the user is used to learn and predict user behavior and propensity (likelihood) to buy certain items. The items the user is most likely to buy are then displayed on the recommendation system. Gmail’s system learns from the messages which the user reads and/replies to and prioritizes them.

Amazon Recommendations

Amazon Recommendations

In both cases, there are some predictions made based on certain example usage of the data. Thus in essence, machine learning is about predictions based on models, which themselves are created based on examples.

More specifically, a machine learning model is a set of explanations of the relationships between the input data and the output predictions. These relationships are discovered from examples of input-output pairs. In machine learning terminology, the input data is also called features and the predictions are called output. Once a model has been created, it can be used on new inputs to predict outputs.

Machine learning models therefore “learn” to predict from examples. This learning is also known as “training” the model, and the associated good quality data-set is called “training data-set“. The stage where the model is used on new inputs is known as “testing” the model, and the data-set associated with it is called “test data-set“.
There are different ways to perform this learning, with different types of algorithms to build models and perform predictions. Most common among these are Classification and Regression techniques. The Gmail spam-filter is an example of Classification technique. Given a set of emails marked as spam or not-spam, it learns the characteristics of emails and is then able to process future email messages to mark them as spam or not. Classification deals with prediction of which class ordinal data fits in, while regression deals with prediction of continuous numeric data. Example of regression is a best-fit line drawn through some data points for generalization. Both classification and regression are examples of supervised learning, as the algorithm is told to predict a label or target value.

The opposite of this is unsupervised learning – where there’s no label or target value given for the data. An example of this could be clustering – a task of grouping a items so that objects in the same group or cluster are more similar (in some sense or another) to each other than to those in other groups (clusters).

Machine Learning Techniques

Machine Learning Techniques

With so many choices, how do you choose the right algorithm? Without considering nuances of the data, a rule of thumb is to look at the objective of the prediction:

  • If the prediction is to forecast a target value, we use supervised learning, else use unsupervised learning or density estimation algorithms.

It is important to note though, that this is not unbreakable, rather usage of algorithms is rather fuzzy. This is quite common in machine learning, where most problems are not deterministic in nature, and often a bunch of different algorithms are tried out to see how they perform. There are also ensemble models like Gradient Boosting – a regression technique which  uses an ensemble of weak prediction models, typically decision trees to get an improved prediction model. An interesting tool based on symbolic regression, which infers the model from the data is Nutonian Eureqa, also dubbed as the robotic data scientist.

Many algorithms are different, but the steps to use one are similar:

Collect data > Prepare data > Analyze input data > Clean data/verify data quality > Train the algorithm > Test the algorithm > Iterate/Deploy. (See also my earlier post on the data science project lifecycle)

As with many other aspects (data wrangling) of data science, both R and Python are very popular languages for using machine learning techniques. There are also start-ups like BigML providing MLaaS or Machine Learning as a Service.

In conclusion of this post, a few points to remember: garbage in – garbage out:- data quality matters as much if not more than algorithms, quantity of data or complexity of algorithm are not substitutes for quality, and of course as with all predictions, machine learning can be wrong as well.