Tag Archives: data scientist

What roles do you need in your data science team?

Over the past few weeks, we’ve had several conversations in our data lab regarding data engineering problems and day to day problems we face with unsupervised data scientists who find it difficult to deploy their code into production.

Data scientist

The data scientist

The opinions from business seemed to cluster around a tacit definition of data scientists as researchers, primarily from statistics or mathematics backgrounds, who are experienced in machine learning algorithms and often in some domain areas specific to our business, (e.g. actuaries in insurance), but not necessarily having skills of writing production-ready code.
The key driver behind the somewhat opposing strain of thought came from the developers and data engineers who often quoted Cloudera’s Director of Data Science – Josh Wills – famous for his “definition of a data scientist tweet”:
Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.

Josh Wills' definition of data scientist

Josh Wills’ definition of data scientist

Wills’ quote reflects the practical issues in finding “unicorn” data scientists and having to do with the best of what’s on offer for a multi-disciplinary area like data science. It’s also perhaps based on his work in startups like Cloudera and web giants like Google, where adopting agile practices like DevOps allow data scientists closer interaction with engineers and therfore substantial experience in deploying to production. Unfortunately, that’s always a challenge due to bureaucracy, mindset, lack of informed opinion and cultural barriers in larger or old-world organizations with legacy systems and practices.

As in any startup or lab working on problems in data science and big data, it’s important for us to clear misconceptions and get the team to a shared understanding of commonly used terms to establish a foundational common language, which would then allow developing a shared vision around our objectives. Therefore it’s necessary to review going beyond definitions of the “unicorn” data scientists and looking at what happens in real-life teams where data scientists work, like ours.

Different perspectives
A lot of the data scientists actually think of themselves as mathematicians, trying to formulate business problems into math/statistics problems and then trying to solve them in the data science projects.
However, the popular misconception arise sometimes out of the big-data hype articles churned out by big data vendors, including some evangelists – who equate data scientists with superpowers across a multitude of disciplines.
The developer’s views arise due to their unique perspectives on the complexities of data wrangling and fragmentation around tools, technologies and languages.

The reality, as always, is quite different from the hype. There are actually probably just a handful of the “unicorn” data scientists on the planet, who have superpowers in maths/stats,AI/machine learning, a variety of programming languages, an even wider variety of tools and techniques, and of course are great in understanding business problems and articulating complex models and maths in business-speak. For the lesser mortals, and less fortunate businesses, we have to do with multiple individuals to combine these skillsets together into a team or data science squad.

Building data science teams

In terms of hiring, building a data science team becomes much easier, once we get around the idea that the “unicorn” data scientists are not really available. The recruitment team and hiring manager can then focus on the individual skills that are required on the team and try to hire for profiles with strengths in these skills. Once hired, the manager’s role switches to building the team in terms of setting expectations and facilitating collaborative team dynamics to evolve self-governing teams, which can then focus on achieving the objectives in a collaborative manner, instead of having to be superheroes.

Dream data science team?

Dream data science team? Einstein, Fisher, Tufte, Jobs

The roles in a data science team

So what roles would a data science team have? Depending upon the organizations’ objectives, the team could either focus on service-oriented consulting gigs, or focus on building reusable assets or data products.

  • Data scientist – this role would be primarily of someone who can work on large datasets (usually on Hadoop/Spark) with machine learning algorithms, develop predictive models, understand the “theory” – maths and stats behind the models and can interpret and explain model behavior in jargon-free language. Typically this role requires good knowledge of SQL and familiarity with at least one programming language for predictive data analysis e.g. R and/ Python.
Netflix requirements for data scientist role

Netflix requirements for data scientist role

  • Data engineer / Data software developer – this role is for someone who has good understanding of distributed programming, including insfrastructure and architecture. Typically this person is comfortable with installation of distributed programming frameworks like Hadoop MapReduce/Spark clusters, can code in more than one programming languages like Scala/Python/Java, and knows Unix scripting and SQL. Based on range and depth of experience, this role can evolve into one of the two specialized roles – that of the data solutions architect and the data platform administrator.
Netflix requirements for data engineer role

Netflix requirements for data engineer role

  • Data solutions architect – A data engineer with a range of deep experience across several distributed technologies, and who also has good understanding of service-oriented architecture concepts and web applications (SOA concepts and REST frameworks) in addition to the developer skillsets.
  • Data platform administrator – A data engineer who has extensive experience across distributed technologies, especially managing clusters including production envionments and good knowledge of cloud computing architectures (public clouds like AWS if using public cloud or OpenStack and Linux sysadmin experience if using private/hybrid clouds)
  • Full-stack developer – This is an optional role – only required for teams which are focused on building data products with web interfaces. The full-stack developer is ideally an experienced web developer with experience in both backend and front-end e.g. a MEAN developer with experience on MongoDB, Express, AngularJS and NodeJS.
  • Designer – this role demands an expert who has deep knowledge of user experience (UX) and interface design, primarily for web/mobile applications depending on target form factors of the data product as well as data visualization and desirably some UI coding expertise. Building quick mockups and wireframes design is often required during product definition, and the designer needs to be able to work with business as well as developers in a collaborative fashion. Sometimes this role is played by front-end UI developers as good designers don’t come cheap.
Netflix requirements for UX designer

Netflix requirements for UX designer

  • Product manager – This is an optional role – only required (but the key one) for teams focused on building data products. Defining the product vision, translating business problems into user stories, and focusing on getting the development team to build data products based on the user stories, aligning product releases and overall roadmap to business requirements and expectations is a key requirement from this role. Having product management experience along with relevant technical expertise is critical for this role due to differences in life-cycles of products and IT projects, as also the ability to present the voice of the customer and balance long-term vision with short-term needs. Back-filling this role with data scientists/developers who do not have product vision/business acumen is dangerous due to lures of gold-plating and lack of project management skills.
Google product manager role

Google requirements for product manager

  • Project manager role may also be optionally required when the team is low on experience. In most successful cases of performance, managers set the objectives and expectations and facilitate to build self-governing teams following agile practices.

Irrespective of whether the data science teams focus on consulting services in one-off projects or build data products which are reused, in both cases, the team would still require a minimum foundation to build on – in terms of processes or shared understanding, and tools and platforms to perform the actual work. We’ll review the data engineering requirements for such tools and platforms in the next post.


Basics of Big Data – Part 1

You can’t miss all the buzz about Big Data! Over the past few years, the buzz around the cloud and Big Data shaping most of the future of computing, IT and analytics in particular has grown incessantly strong. As with most buzz words, which are then hijacked by marketing to suit their own products’ storylines, but which nonetheless manage to confuse users in business and staff in IT as well, Big Data means several things to several people.

So what exactly is Big Data?

Big Data refers to the enormously large datasets that are challenging to store, search, share and analyze with conventional data storage and processing systems. While these challenges remain, our ability to generate such large datasets have grown exponentially over the past few years. With the march of the digital age, and the growing popularity of social networking, the amount of data generated today is growing enormously. Not only within public domains like Google, Facebook, Youtube and Twitter but also within organizations, the amount of data being generated with more powerful servers, softwares and applications far exceeds our capacity to effectively analyze and make sense of this data.

The table below shows the growth of data and the new terminology that has been coined to address the growth of data volumes.

Amount of data



Real-world analogy

103 bytes

Kilobytes (kB)

1.44 MB High-density Floppy disk

Files, Folders

106 bytes

Megabytes (MB)

Disks, tape

Folders, Cabinets

109 bytes – 1012 bytes

Gigabytes (GB) – Terabytes (TB)

Disk arrays


1015 bytes

Petabytes (PB)


1018 bytes

Exabytes (EB)


1021 bytes

Zettabytes (ZB)


1024 bytes

Yottabytes (YB)


Volume, Velocity, Variety & Value

The 3Vs of Big Data – volume, velocity and variety have been popularized by Gartner’s analysis. Gartner defines Big Data as “high volume, velocity, and variety information assets that demand cost-effective, innovative forms of information process for enhanced insight and decision making.” What it essentially means is Big Data beyond the high volumes, moves too fast and is not always structured according to conventional database architectures. For example, multimedia content uploaded on YouTube or comments on Twitter or Facebook, coupled with the velocity at which it is generated and churned makes it obvious that this data is not in a structured format for conventional data processing. To analyze and gain insights from this data, rather derive “value”, would require a wholly different approach.

Old wine in new bottle? Data mining and Big Data Analytics

A modeling approach is required to derive value out of Big Data. A hypothesis is proposed, statistical models are created and validated or updated using data. It is interesting to note that this approach bears substantial resemblance or even overlap with “data  mining”. For those unfamiliar with the term, the more obscure and geeky part of business intelligence and information management, “data mining”  is the process of discovering patterns in large datasets, usually in data warehouses, involving several or all of methods and tools of statistics, databases, machine learning and artificial intelligence. This hypothesis-model-validate and refine approach for deriving value out of Big Data, could be manual with help from data analysts or specialists (data scientists), but could also be “machine-based” depending on adaptive machine-learning. It is important to understand due to the velocity of the data, the algorithm for deriving value could be short-lived and actions based on the insights may need to be implemented rather quickly.

As an illustration, consider minor changes done by Google in its algorithms to serve ads in its search results, collection of a large dataset based on usage for a limited period (maybe a few hours) and analyzing it to understand user response to specific Adwords across dimensions like demographics, geolocation, events, timing etc. can provide Google valuable insights on how to tweak its algorithms to serve advertising and generate the maximum revenue out of it.

The rise of the “Data Scientist”

While the most popular data mining example remains the myth about Walmart’s correlation of beer purchases with diapers, there are celebrated stories about statisticians at other retailers like Target using data mining to understand users’ buying patterns helping focus marketing efforts on target customers. Organizations now understand that there is hidden value in the Big Data haystack which can be leveraged for competitive advantage. With the amount of adaptive machine learning and automation, it used to be argued even a few years ago whether theory and modeling by analysts would be needed at all and whether the sheer volume of Big Data was sufficient to measure and predict patterns. However, it has slowly been understood that due to the breadth and depth of data to be examined, and to ensure correlation between the business context and the data being analyzed, there needs to be a key role played by humans in making sense of Big Data. Though not exclusively tied to Big Data projects, the data scientist role is the new role envisaged for analyzing data across multiple sources and delivering insights related to business problems. In essence, a data scientist is a marriage of 2 roles: the business analyst and the data mining statistician. A data scientist typically has similar backgrounds as a data analyst, being trained in statistics, mathematics, modeling and analytics as well as having strong business acumen and communication skills to convey highly technical analyses in business terms to both business and IT leadership. The data scientist has been variously cited as being an awesome nerd to having the sexiest job in the 21st century.

The tools for Big Data analytics

As with any new technology and hype cycles, enterprises are careful and calibrate their involvement in joining the Big Data bandwagon. It is no wonder that organizations which generate most of the Big Data in public domain e.g. social networks like Facebook or Google, also make the most use of analytics to derive value from Big Data. e.g. Google used Big Data to identify its famous hiring puzzles and brain-teasers were useless.  Facebook currently collects 320TB data each day working with Hadoop and Hive and is adding on a relational data warehouse of around 250PB. In fact, a lot of the open source tools and technologies in use today have been developed by these organizations. These include Hadoop which is an open-source Apache project, allowing storage of enormous datasets across distributed clusters of servers and running distributed analysis applications on the clusters. It utilizes the MapReduce parallel processing programming model, originally developed by Google to process large datasets on a cluster or distributed clusters.

I will explore details of Hadoop, Hive, HDFS, NoSQL, Spark, Impala and other technologies related to Big Data in the second part of this post. I will also explore the unique challenges of Big Data architecture including rapid use, rapid interpretation, lack of schema to sync across tools and other data, data quality and governance and integrating Big Data analytics into the enterprise architecture.

Read the series on Big Data: Part-1 : Basics, Part-2 : Hadoop, Part-3 : Hadoop data warehouse and Part-4 : NoSQL