Tag Archives: big data

An introduction to Data Science

I presented a talk last week introducing Data Science and associated topics to some enthusiasts.
Here’s a slide deck I created quickly with markdown using Swipe – a start-up building HTML5 presentation tools.
Here are the slides: https://www.swipe.to/2675ch


BI in the digital era

Sometime back I presented a webinar on BrightTalk. The slides for the talk have now been uploaded on Slideshare. The talk focused more on changes in digital technology disrupting businesses, the effect of Big Data, the FOMO (Fear of missing out) effect on big business – and what it meant for changes to the way we do business intelligence in the digital era.

Key themes:
* Disruption in traditional IT with cloud computing
* Changing economics and changing business models
* Rise of Big Data
* Tech changes to manage Big Data – distributed computing
* Shift from “current-state” to “next-state” questions
* Introducing Data Science
* Challenges – regulatory, data privacy
* Dangers of data science – over-fitting, interpretation
* Managing big data projects
* Data Science MOOCs (massive open online courses), tools and resources

Basics of Big Data – Building a Hadoop data warehouse

This is the 3rd part of a series of posts on Big Data. Read Part-1 (What is Big Data) and Part-2 (Hadoop).

Traditionally data warehouses have been built with relational databases as backbone. With the new challenges (3Vs) of Big Data, relational databases have been falling short of the requirements of handling

  • New data types (unstructured data)
  • Extended analytic processing
  • Throughput (TB/hour loading) with immediate query access

The industry has turned to Hadoop as a disruptive solution for these very challenges.

The new Hadoop architecture (courtesy Hortonworks):

Hadoop 2.0 with YARN

Hadoop 2.0 with YARN

Comparing the RDBMS and Hadoop data warehousing stack:

The heavy-lifting for any data warehouse is the ETL (Extract-Transform-Load) or ELT processing. The processing layers involved in the data warehousing ETL are different across conventional RDBMS and Hadoop.

Layer Conventional RDBMS Hadoop Advantages of Hadoop over conventional RDBMS
Storage Database tables HDFS file system HDFS is purpose-built for extreme IO speeds
Metadata System tables HCatalog All clients can use HCatalog to read files.
Query SQL query engine Multiple engines (SQL and non-SQL) Multiple query engines like Hive or Impala are available.

The Hadoop USP: Exploratory analytics

All 3 layers in a conventional RDBMS are glued together into a proprietary bundle unlike Hadoop where each layer is independent and separate, allowing multiple access. Apart from these 3 layers, Hadoop provides an important advantage for exploratory BI in a single step from data load to query, which is not available in conventional RDBMS.

The data-load-to-query in one step involves:

1. Copy data into HDFS with ETL tool (e.g. Informatica), Sqoop or Flume into standard HDFS files (write once). This registers the metadata with HCatalog.

2. Declare the query schema in Hive or Impala, which doesn’t require data copying or re-loading, due to the schema-on-read advantage of Hadoop compared with schema-on-write constraint in RDBMS.

3. Explore with SQL queries and launching BI tools e.g. Tableau, BusinessObjects for exploratory analytics.

Other data warehousing use cases with Hadoop:

1. High performance DW using Parquet columnar file:

The Parquet file format has a columnar storage layout with flexible compression options and its layout is optimized for queries that process large volumes of data. It is accessible to multiple query and analysis applications and can be updated and have the schema modified. A high performance DW can be created by copying raw data from raw HDFS into Parquet files which can then be queried in real-time using Hive or Cloudera Impala.

2. Platform for transforming data or ETL:

The (extract-transform-load) ETL or ELT (extract-load-transform) use case for Hadoop is well established. The reasons why Hadoop is a popular ETL platform are:

  • Hadoop itself is a general-purpose massively parallel processing (MPP) platform.
  • Hadoop’s NoSQL database – it’s flexible schema-on-read offers a relaxed alternative to the rigid strongly-typed schema-on-write model of a relational data warehouse.
  • Hadoop is highly cost-effective (up to 50 to 100 times cheaper) compared to conventional relational data warehouse on a per-Terabyte basis, due to its commodity hardware distributed architecture (low-cost scale-out) compared to the relatively high-end infrastructure required for conventional systems (high-cost scale-up).
  • Higher performance on a per-core basis (CPU processing power) allows Hadoop to beat most conventional ETL systems.
  • Most ETL vendors already market versions of their software which leverage Hadoop, whether this is through using Hadoop connectors (e.g. Oracle Data Integrator) or ETL-optimized libraries for MapReduce (e.g. Syncsort DMX-h).

Most such ETL tools allows existing developers to build Hadoop ETL without having to code or script in MapReduce with Java or Pig. Several vendors also provide ETL-on-Hadoop automation, and a rich user experience (UX) with drag-and-drop design of jobs or workflows, and even native integration taking advantage of (Yet Another Resource Negotiator) YARN in Hadoop v2.0

3. Advanced analytics:

Predictive analytics, statistics and other categories of advanced analytics produce insights that traditional BI approaches like query and reporting may find unlikely to discover. Advanced analytics platforms and tools become more important when managing big data volumes due to issues of scalability and performance.

Strategic applications, like data mining, machine learning and analytics on new types of data, including unstructured data, with much improved performance due to Hadoop’s distributed architecture also allow for new types of integrated analysis previously not possible on relational database platforms. Several newer tools and vendors support using statistical packages and languages like R (and SAS or SPSS) for big data analytics.

4. DW offloading:

Due to low storage costs (50 to 100 times less on a per-TB basis), Hadoop is well suited to keeping data online for an indefinite period of time. This is also known as a data lake or an enterprise data hub, where various types of data can be kept till newer use cases are discovered for such data. Such storage also allows for “active archival” of infrequently used data from (Enterprise Data Warehouse) EDW, thus allowing only the most valuable (and usually most recent) data to be kept in the EDW.

With ELT loads driving up to 80% of database capacity, Hadoop can also be used as a staging area for data preparation and ELT to allow offloading data processing from the EDW

With the big data deluge, and its 3 equally challenging dimensions of volume, velocity and variety – existing conventional platforms are finding it difficult to meet all of an organization’s data warehousing needs along with ETL processing times and availability SLAs. The balance of power keeps tilting towards Hadoop with newer tools and appliances extending the capabilities of Hadoop and with its superior price/performance ratio, building a data warehouse leveraging Hadoop needs to be given serious consideration.

Read the series on Big Data: Part-1 : Basics, Part-2 : Hadoop, Part-3 : Hadoop data warehouse and Part-4 : NoSQL

Basics of Big Data – Part 2 – Hadoop

As discussed in Part 1 of this series, Hadoop is the foremost among tools being currently used for deriving value out of Big Data. The process of gaining insights from data through Business Intelligence and analytics essentially remains the same. However, with the huge variety, volume and velocity (the 3Vs of Big Data), it’s become necessary to re-think of the data management infrastructure. Hadoop, originally designed to be used with the MapReduce algorithm to solve parallel processing constraints in distributed architectures (e.g. web indexing) of web giants like Yahoo or Google, has become the de-facto standard for Big Data (large-scale data-intensive) analytics platforms.

What is Hadoop?

Think of Hadoop as an operating system for Big Data. It is essentially a flexible and available architecture for large scale computation and data processing on a network of commodity hardware.

Conceptually, the key components of the Java-based Hadoop framework are a file store and a distributed processing system:

1. Hadoop Distributed File System (HDFS): provides reliable (fault-tolerant), scalable, low-cost storage

2. MapReduce: Batch-oriented, distributed (parallel) data processing system with resource management and scheduling

As of October 2013, the 2.x GA release of Apache Hadoop also included an enhancement – a key third component:

3. YARN: a general purpose resource management system for Hadoop to allow MapReduce and other data processing services

Hadoop architecture stack

Open-source Hadoop is an Apache project. There are however commercial distributions of Hadoop (similar to UNIX distros) most notably from Cloudera, Hortonworks, MapR, IBM, Amazon etc. The Hadoop ecosystem has several projects in development, seeking to enhance the Hadoop framework to make it more suited to performing Big Data tasks including ETL and analytics.

The key components of the Hadoop distribution:

1. Distributed file system and storage – HDFS

HDFS – a Java based file system providing scalable and reliable data storage, designed to span large clusters of commodity servers

2. Data integration – Flume, Sqoop

Flume – service for integrating large amounts of streaming data (e.g. logs) into HDFS

Sqoop – tool for transferring bulk data between Hadoop and structured databases e.g. RDBMSes

3. Data access – HBase, Hive, Pig, Impala (CDH version for interactive SQL query), Storm, MapReduce jobs in Java/Python etc.

HBase – a non-relational (NoSQL) columnar database running on top of HDFS.

Hive – a data warehouse infrastructure built on Hadoop, providing a mechanism to project structure onto the data  and query it using SQL like language – HiveQL

Pig – allows writing complex MapReduce jobs using a scripting language – PigLatin

Impala – SQL query engine running natively in Hadoop, allows querying data in HDFS and HBase. It is part of Cloudera’s CDH distribution.

Storm – provides real-time data processing capabilities to Hadoop which is traditionally batch oriented (based on MapReduce).

4. Operations– Oozie, Ambari, ZooKeeper

Oozie – Java web application used to schedule Hadoop jobs

Ambari – Framework and tools to provision, manage and monitor Hadoop clusters

ZooKeeper – provides operational services for Hadoop – e.g. distributed configuration service, named registry, synchronization service etc.

5. Resource management – YARN

YARN – separates the resource management and processing components in Hadoop 2.x which used to be done in MapReduce packages in Hadoop 1.x

A schematic of Cloudera’s Hadoop distribution (CDH) is shown below:


Why Hadoop?

Hadoop has gained immense traction in a very short amount of time and is proving useful in a range of applications, including deriving insights from Big Data analytics.

The key advantages of Hadoop as a data processing platform are:

1. Scalability and availability – Due to its ability to store and distribute extremely large datasets across hundreds of inexpensive servers operating in parallel, Hadoop offers extreme scalability. With high-availability HDFS feature in Hadoop 2.0 providing redundant namenodes for standby and failover, Hadoop now also provides high availability

2. Cost-effectiveness – Due to its design incorporating fault-tolerance and scale-out architecture, Hadoop clusters can be built with relatively inexpensive commodity hardware instead of costly blade servers, thereby providing great savings for storage and computing abilities on a per TB basis.

3. Resilience – With built-in fault tolerance, e.g. multiple copies of data replicated on cluster nodes, and with high availability HDFS in version 2.0, Hadoop provides cost-effective resilience to faults and data loss.

4. Flexibility and performance – Ability to access and store various types of data – both structured and unstructured, with no constraints of schema-on-write, along with the emergence of new ways of accessing and processing data – e.g. Storm for real-time/streaming data, SQL-like tools including Impala, Hadapt, Stinger etc.

Due to these key advantages, Hadoop lends itself to several data processing use cases. Key use cases are:

1. Data store / Enterprise data warehouse (EDW) – cost-effective storage for all of an organization’s ever expanding data

2. Active archive – allowing cost-effective querying on historical data from archival systems

3. Transformation – executing data transformations (T step of ETL/ELT) for improved throughput and performance

4. Exploration – allows fast exploration and quicker insights from new questions and use cases, taking advantage of Hadoop’s schema-on-read model instead of schema-on-write models of traditional relational databases

5. Real-time applications – usage of flexible add-ons like Storm to provide dynamic data mashups

6. Machine learning, data mining, predictive modeling and advanced statistics

The early adopters of Hadoop are the web giants like Facebook, Yahoo, Google, LinkedIn, Twitter etc.

Facebook uses Hadoop – Hive and HBase for data warehousing (over 300 PB in aggregate and over 600 TB daily data inflows) and real-time application, serving up dynamic pages customized for each of its over 1.2 billion users.

Yahoo uses Hadoop and Pig for data processing and analytics, web search, email antispam and ad serving with more than 100,000 CPUs in over 40,000 servers running Hadoop with 170 PB of storage .

Google had used MapReduce to create its web index from crawl data and also uses Hadoop clusters on its cloud platform with Google Compute Engine (GCE).

LinkedIn uses Hadoop for data storage and analytics driving personalized recommendations like “People you may know” and ad targeting.

Twitter uses Hadoop – Pig and HBase for data visualization, social graph analysis and machine learning.

Limitations of Hadoop

While Hadoop is the most well-known Big Data solution, it is just one of the components in the Big Data landscape. While in theory, Hadoop is infinitely scalable and resilient and allows a great deal of flexibility in storing structured and unstructured data, in practice, there are several considerations to be taken care of while architecting Hadoop clusters due to the inherent limitations of Hadoop.

1. Workloads – Hadoop is suitable for various types of workloads, however mixed workloads or situations where the workload may vary widely or is not known ahead, makes it difficult to optimize the Hadoop architecture.

2. Integration – Hadoop should not be a stand-alone solution, else it will quickly become a data silo unconnected with the rest of the data management infrastructure. The Hadoop strategy needs to fit into the overall data management and processing framework of the organization to allow for growth and maintenance while not sacrificing on flexibility and agility

3. Security – In the enterprise, security is a big deal. While Hadoop was originally built without a security model, the Hadoop ecosystem is evolving with various projects for security, including Kerberos authentication, the Sentry offering from Cloudera, Project Rhino from Intel, Apache Knox as reverse proxy (with contribution from Hortonworks) or using Apache Accumulo for cell-level security;  however most are complex to setup and there is still no reference standard across deployments.

4. Complexity – the complexity of Hadoop as a Big Data platform lies in its evolving ecosystem of newer technologies, with most data warehousing and analytics specialists skilled in traditional relational databases, SQL and techniques which are difficult to use on Hadoop due to the lack of tools (e.g. still evolving SQL access) and the need for additional skills including data mining or advanced statistical techniques.

5. Availability – Up until the 2.0 release, Hadoop with single-master nodes in HDFS and MapReduce was subject to single point of failure.

6. Inefficiency – HDFS is inefficient for handling small files thereby making analysis on smaller datasets extremely inefficient. This is especially painful while designing models or finding patterns on smaller datasets. MapReduce is also a batch-oriented architecture not suitable for real-time access, but this is being addressed with tools like Storm. Tools like Impala provide interactive SQL-like querying on HDFS, which helps in improving quick adhoc analysis on smaller datasets.

7. Processing framework – Not all data processing problems or analytic questions can be designed with the MapReduce framework. Hadoop is therefore ill suited for such problems which cannot be expressed as problems with Map and Reduce steps and need other data processing paradigms. There are improvements being developed with Storm for real-time access or Spark for improving the data analytics performance with in-memory distributed computing to get around these issues.

In the next parts of this series, I will explore topics of building a Hadoop data warehouse, big data analytics with tools like R as well as other Big Data solutions, Hadoop enhancements  and alternatives to Hadoop.

Read the series on Big Data: Part-1 : Basics, Part-2 : Hadoop, Part-3 : Hadoop data warehouse and Part-4 : NoSQL

Basics of Big Data – Part 1

You can’t miss all the buzz about Big Data! Over the past few years, the buzz around the cloud and Big Data shaping most of the future of computing, IT and analytics in particular has grown incessantly strong. As with most buzz words, which are then hijacked by marketing to suit their own products’ storylines, but which nonetheless manage to confuse users in business and staff in IT as well, Big Data means several things to several people.

So what exactly is Big Data?

Big Data refers to the enormously large datasets that are challenging to store, search, share and analyze with conventional data storage and processing systems. While these challenges remain, our ability to generate such large datasets have grown exponentially over the past few years. With the march of the digital age, and the growing popularity of social networking, the amount of data generated today is growing enormously. Not only within public domains like Google, Facebook, Youtube and Twitter but also within organizations, the amount of data being generated with more powerful servers, softwares and applications far exceeds our capacity to effectively analyze and make sense of this data.

The table below shows the growth of data and the new terminology that has been coined to address the growth of data volumes.

Amount of data



Real-world analogy

103 bytes

Kilobytes (kB)

1.44 MB High-density Floppy disk

Files, Folders

106 bytes

Megabytes (MB)

Disks, tape

Folders, Cabinets

109 bytes – 1012 bytes

Gigabytes (GB) – Terabytes (TB)

Disk arrays


1015 bytes

Petabytes (PB)


1018 bytes

Exabytes (EB)


1021 bytes

Zettabytes (ZB)


1024 bytes

Yottabytes (YB)


Volume, Velocity, Variety & Value

The 3Vs of Big Data – volume, velocity and variety have been popularized by Gartner’s analysis. Gartner defines Big Data as “high volume, velocity, and variety information assets that demand cost-effective, innovative forms of information process for enhanced insight and decision making.” What it essentially means is Big Data beyond the high volumes, moves too fast and is not always structured according to conventional database architectures. For example, multimedia content uploaded on YouTube or comments on Twitter or Facebook, coupled with the velocity at which it is generated and churned makes it obvious that this data is not in a structured format for conventional data processing. To analyze and gain insights from this data, rather derive “value”, would require a wholly different approach.

Old wine in new bottle? Data mining and Big Data Analytics

A modeling approach is required to derive value out of Big Data. A hypothesis is proposed, statistical models are created and validated or updated using data. It is interesting to note that this approach bears substantial resemblance or even overlap with “data  mining”. For those unfamiliar with the term, the more obscure and geeky part of business intelligence and information management, “data mining”  is the process of discovering patterns in large datasets, usually in data warehouses, involving several or all of methods and tools of statistics, databases, machine learning and artificial intelligence. This hypothesis-model-validate and refine approach for deriving value out of Big Data, could be manual with help from data analysts or specialists (data scientists), but could also be “machine-based” depending on adaptive machine-learning. It is important to understand due to the velocity of the data, the algorithm for deriving value could be short-lived and actions based on the insights may need to be implemented rather quickly.

As an illustration, consider minor changes done by Google in its algorithms to serve ads in its search results, collection of a large dataset based on usage for a limited period (maybe a few hours) and analyzing it to understand user response to specific Adwords across dimensions like demographics, geolocation, events, timing etc. can provide Google valuable insights on how to tweak its algorithms to serve advertising and generate the maximum revenue out of it.

The rise of the “Data Scientist”

While the most popular data mining example remains the myth about Walmart’s correlation of beer purchases with diapers, there are celebrated stories about statisticians at other retailers like Target using data mining to understand users’ buying patterns helping focus marketing efforts on target customers. Organizations now understand that there is hidden value in the Big Data haystack which can be leveraged for competitive advantage. With the amount of adaptive machine learning and automation, it used to be argued even a few years ago whether theory and modeling by analysts would be needed at all and whether the sheer volume of Big Data was sufficient to measure and predict patterns. However, it has slowly been understood that due to the breadth and depth of data to be examined, and to ensure correlation between the business context and the data being analyzed, there needs to be a key role played by humans in making sense of Big Data. Though not exclusively tied to Big Data projects, the data scientist role is the new role envisaged for analyzing data across multiple sources and delivering insights related to business problems. In essence, a data scientist is a marriage of 2 roles: the business analyst and the data mining statistician. A data scientist typically has similar backgrounds as a data analyst, being trained in statistics, mathematics, modeling and analytics as well as having strong business acumen and communication skills to convey highly technical analyses in business terms to both business and IT leadership. The data scientist has been variously cited as being an awesome nerd to having the sexiest job in the 21st century.

The tools for Big Data analytics

As with any new technology and hype cycles, enterprises are careful and calibrate their involvement in joining the Big Data bandwagon. It is no wonder that organizations which generate most of the Big Data in public domain e.g. social networks like Facebook or Google, also make the most use of analytics to derive value from Big Data. e.g. Google used Big Data to identify its famous hiring puzzles and brain-teasers were useless.  Facebook currently collects 320TB data each day working with Hadoop and Hive and is adding on a relational data warehouse of around 250PB. In fact, a lot of the open source tools and technologies in use today have been developed by these organizations. These include Hadoop which is an open-source Apache project, allowing storage of enormous datasets across distributed clusters of servers and running distributed analysis applications on the clusters. It utilizes the MapReduce parallel processing programming model, originally developed by Google to process large datasets on a cluster or distributed clusters.

I will explore details of Hadoop, Hive, HDFS, NoSQL, Spark, Impala and other technologies related to Big Data in the second part of this post. I will also explore the unique challenges of Big Data architecture including rapid use, rapid interpretation, lack of schema to sync across tools and other data, data quality and governance and integrating Big Data analytics into the enterprise architecture.

Read the series on Big Data: Part-1 : Basics, Part-2 : Hadoop, Part-3 : Hadoop data warehouse and Part-4 : NoSQL