Tag Archives: Hadoop

Why Spark is the big data platform of the future

Apache Spark has created a lot of buzz recently. In fact, beyond the buzz, Apache Spark has seen phenomenal adoption and has been marked out as the successor to Hadoop MapReduce.

Apache Spark

Apache Spark

Google Trends confirms the hockey stick like growth in interest in Apache Spark.  All leading Hadoop vendors, including Cloudera, now include Apache Spark in their Hadoop distribution.

GoogleTrends - Apache Spark

GoogleTrends – Interest in Apache Spark

So what exactly is Spark, and why has it generated such enthusiasm? Apache Spark is an open-source big data processing framework designed for speed and ease of use.  Spark is well-known for its in-memory performance, but that has also given rise to misconceptions about its on-disk abilities. Spark is in fact a general execution engine – which has a greatly improved performance both in-memory as well as on-disk, when compared with older frameworks like MapReduce. With its advanced DAG (directed acyclic graph) execution engine, Spark can run programs up to 100x faster than MapReduce in memory, or 10x faster on-disk. Why is Spark faster than MapReduce?

  • A key step during MapReduce operations is the synchronization or  “shuffle” step, intermediate between the “map”-step and the “reduce”-step. Apache Spark implements a sort-based shuffle design, which improves performance.
  • Apache Spark also includes a DAG (directed-acyclic graph) which allow developers to execute DAGs all at once, not step by step. This eliminates the costly synchronization required by MapReduce. Note that DAGs are also used by Storm and Tez
  • Spark supports in-memory data sharing across DAGs, so different jobs can work with the same data at a very high speed.

It’s important to remember that Hadoop is a decade-old technology, developed at a time when memory was still relatively expensive, and therefore took the design approach of persistence to disk as a way of maintaining state between execution steps. On the other hand, Spark was developed at UC Berkeley AMPLab in 2009 and then it was open-sourced in 2010 – when memory had become much cheaper. Therefore, Spark stores data in memory and transparently persists it to disk if needed, thereby achieving better performance. The core concept of Spark is this programming abstraction over data storage – called RDDs (Resilient Distributed Dataset). Under the hood, Spark automatically distributes the data contained in RDDs across the cluster and parallelizes the operations performed on them.

Word-count code in Spark

Word-count code in Spark

The end result is that, on an average – the lines of code required to develop a distributed data processing program is much less in Spark, when compared with MapReduce. See more details on why Spark is the fastest open-source engine for sorting a petabyte. Clearly, faster execution has been one of the key reasons for the uptake of Spark, but Spark also provides further advantages. Similar to YARN, the upgrade of the Hadoop framework over the MapReduce-only version, Spark allows a wide range of workloads from batch to interactive and streaming. It reduces the burden of maintaining separate tools as in Hadoop – and provides APIs in Scala, Java, Python and SQL.  Spark can run over a variety of cluster-managers, including Hadoop YARN, Apache Mesos, and Spark’s own standalone scheduler. Spark components



Spark Core – provides basic functionality of Apache Spark, including RDDs and APIs to manipulate them. Spark SQL – A new component which replaces the older Shark (SQL on Spark) project, this package provides better integration with Spark Core, it allows querying data through SQL and HiveQL and supports many data sources from Hive tables, Parquet and JSON. Spark SQL also allows developers to intermix SQL queries with the code for data manipulations with RDDs in Python, Java, and Scala. It also provides fast SQL connectivity to BI tools like Tableau or QlikView.



Spark Streaming – based on micro-batching, this component enables processing of real-time streaming data. It uses DStreams, which are series of RDDs, to process real-time data. The Spark Streaming API is very similar to the Spark Core RDD APIs, making it easy for developers to reuse and adapt code for batch to interactive or real-time applications. MLlib – provides a library of machine learning algorithms including classification, regression, clustering, and collaborative filtering, as well as model evaluation and data import. GraphX – provides an API for graphs and graph-parallel computations and operators for manipulating graphs and a library of graph algorithms. The SparkR project aims to provide a light-weight front-end to use Apache Spark from R. Work is on to integrate SparkR into Spark. Recently, Spark has introduced a dataframe library with R/Pandas syntax for use across all of the Spark language APIs and an ML pipeline API which also integrates with data frames. Spark adoption is increasing manifold, boosted by increased third-part vendor support. Databricks – the company spun out of AMPLab by the creators of Apache Spark, now provides Spark as a service on the cloud – with its own Databricks Cloud – which is in private beta. The Databricks cloud is designed to support data science in the lab as well as in the factory – by creating polyglot notebooks (mix of Scala/Java/Python/SQL possible) and building production pipelines for ETL and analytics jobs. Tableau and MemSQL have provided Spark connectors, Altiscale now provides Spark in the cloud and machine learning vendors like Nube are building products like Reifier to perform entity resolution and de-duplication using Spark. ClearStory Data provides Spark-based data processing and analytics. There is also a fledgling community of packages for Apache Spark. Big data and data science projects are complex with an increasing diverse toolset which require massive integration efforts. Greater flexibility than that provided by MapReduce, capability to support a variety of workloads and a simpler, more unified ecosystem of tools which work out of the box on a general execution engine (Apache Spark) thus provide better simplicity than the complex zoo of Hadoop MapReduce projects. Together with SparkSQL and dataframes library, Spark democratizes access to distributed data processing beyond MapReduce programmers extending it to other developers and business analysts. Over and above, considering the fast performance of Spark, it is no wonder that Apache Spark continues to gain traction and looks all set to be the default framework for Big data processing in the near future. More info:

Read the series on Big Data: Part-1 : Basics, Part-2 : Hadoop, Part-3 : Hadoop data warehouse and Part-4 : NoSQL


Set up a Hadoop Spark cluster in 10 minutes with Vagrant

With each of the big 3 Hadoop vendors – Cloudera, Hortonworks and MapR each providing their own Hadoop sandbox virtual machines (VMs), trying out Hadoop today has become extremely easy. For a developer, it is extremely useful to download a get started with one of these VMs and try out Hadoop to practice data science right away.

Vagrant Hadoop Spark Cluster

Set up a Hadoop-Spark cluster with Vagrant in 10 minutes

However, with the core Apache Hadoop, these vendors package their own software into their distributions, mostly for the orchestration and management, which can be a pain due to the multiple scattered open-source projects within the Hadoop ecosystem. e.g. Hortonworks includes the open-source Ambari while Cloudera includes its own Cloudera Manager for orchestrating Hadoop installations and managing multi-node clusters.

Moreover, most of these distributions require today a 64-bit machine and sometimes a high-amount of memory (for a laptop). e.g. running Cloudera Manager with a full-blown Cloudera Hadoop Distribution (CDH) 5.x requires at least 10GB RAM. For a developer with a laptop, RAM is always at a premium, hence it may seem easier to try out the vanilla Apache Hadoop downloads for installations. The documentation for Hadoop for installing a single-node cluster, and even a multi-node cluster is much improved nowadays, but with the hassles of downloading the distributions and setting up SSH, it can easily take up a long-time to effectively set up a useful multi-node cluster. The overhead of setting up and running multiple VMs can also be a challenge. The vanilla distributions also require separate installations for UI (Cloudera Hue being a nice one) and job tracking (Oozie) or orchestration (Ambari). Unfortunately Ambari works only with select versions of HDP (Hortonwork’s distribution of Hadoop), and configuring Oozie with disparate versions of Hadoop, Java and other libraries can be a real pain.

One of the solutions to this problem is to use a container-based approach to installation. Hadoop clusters can be setup with LXC (Linux containers) approach, e.g. with the very popular Docker. There are also other approaches with using Puppet, Ansible, Chef and Salt which allow easy installations. One of the simpler approaches that I tried apart from vanilla Hadoop is using Vagrant. Indeed setting up VMs with Vagrant is a breeze, and with a vagrant script (written in Ruby), setting up a multi-node cluster is very quick. In fact you can get started with a Hadoop and Spark multi-node cluster in less than 10 minutes.

Check out the project on Github – it’s adapted from Jee Vang’s excellent Vagrant project to allow for 32-bit machines,  speed-up with pre-downloads of Hadoop, Spark and Java, and includes an updated Readme with script change locations detailed.

Basics of Big Data – Building a Hadoop data warehouse

This is the 3rd part of a series of posts on Big Data. Read Part-1 (What is Big Data) and Part-2 (Hadoop).

Traditionally data warehouses have been built with relational databases as backbone. With the new challenges (3Vs) of Big Data, relational databases have been falling short of the requirements of handling

  • New data types (unstructured data)
  • Extended analytic processing
  • Throughput (TB/hour loading) with immediate query access

The industry has turned to Hadoop as a disruptive solution for these very challenges.

The new Hadoop architecture (courtesy Hortonworks):

Hadoop 2.0 with YARN

Hadoop 2.0 with YARN

Comparing the RDBMS and Hadoop data warehousing stack:

The heavy-lifting for any data warehouse is the ETL (Extract-Transform-Load) or ELT processing. The processing layers involved in the data warehousing ETL are different across conventional RDBMS and Hadoop.

Layer Conventional RDBMS Hadoop Advantages of Hadoop over conventional RDBMS
Storage Database tables HDFS file system HDFS is purpose-built for extreme IO speeds
Metadata System tables HCatalog All clients can use HCatalog to read files.
Query SQL query engine Multiple engines (SQL and non-SQL) Multiple query engines like Hive or Impala are available.

The Hadoop USP: Exploratory analytics

All 3 layers in a conventional RDBMS are glued together into a proprietary bundle unlike Hadoop where each layer is independent and separate, allowing multiple access. Apart from these 3 layers, Hadoop provides an important advantage for exploratory BI in a single step from data load to query, which is not available in conventional RDBMS.

The data-load-to-query in one step involves:

1. Copy data into HDFS with ETL tool (e.g. Informatica), Sqoop or Flume into standard HDFS files (write once). This registers the metadata with HCatalog.

2. Declare the query schema in Hive or Impala, which doesn’t require data copying or re-loading, due to the schema-on-read advantage of Hadoop compared with schema-on-write constraint in RDBMS.

3. Explore with SQL queries and launching BI tools e.g. Tableau, BusinessObjects for exploratory analytics.

Other data warehousing use cases with Hadoop:

1. High performance DW using Parquet columnar file:

The Parquet file format has a columnar storage layout with flexible compression options and its layout is optimized for queries that process large volumes of data. It is accessible to multiple query and analysis applications and can be updated and have the schema modified. A high performance DW can be created by copying raw data from raw HDFS into Parquet files which can then be queried in real-time using Hive or Cloudera Impala.

2. Platform for transforming data or ETL:

The (extract-transform-load) ETL or ELT (extract-load-transform) use case for Hadoop is well established. The reasons why Hadoop is a popular ETL platform are:

  • Hadoop itself is a general-purpose massively parallel processing (MPP) platform.
  • Hadoop’s NoSQL database – it’s flexible schema-on-read offers a relaxed alternative to the rigid strongly-typed schema-on-write model of a relational data warehouse.
  • Hadoop is highly cost-effective (up to 50 to 100 times cheaper) compared to conventional relational data warehouse on a per-Terabyte basis, due to its commodity hardware distributed architecture (low-cost scale-out) compared to the relatively high-end infrastructure required for conventional systems (high-cost scale-up).
  • Higher performance on a per-core basis (CPU processing power) allows Hadoop to beat most conventional ETL systems.
  • Most ETL vendors already market versions of their software which leverage Hadoop, whether this is through using Hadoop connectors (e.g. Oracle Data Integrator) or ETL-optimized libraries for MapReduce (e.g. Syncsort DMX-h).

Most such ETL tools allows existing developers to build Hadoop ETL without having to code or script in MapReduce with Java or Pig. Several vendors also provide ETL-on-Hadoop automation, and a rich user experience (UX) with drag-and-drop design of jobs or workflows, and even native integration taking advantage of (Yet Another Resource Negotiator) YARN in Hadoop v2.0

3. Advanced analytics:

Predictive analytics, statistics and other categories of advanced analytics produce insights that traditional BI approaches like query and reporting may find unlikely to discover. Advanced analytics platforms and tools become more important when managing big data volumes due to issues of scalability and performance.

Strategic applications, like data mining, machine learning and analytics on new types of data, including unstructured data, with much improved performance due to Hadoop’s distributed architecture also allow for new types of integrated analysis previously not possible on relational database platforms. Several newer tools and vendors support using statistical packages and languages like R (and SAS or SPSS) for big data analytics.

4. DW offloading:

Due to low storage costs (50 to 100 times less on a per-TB basis), Hadoop is well suited to keeping data online for an indefinite period of time. This is also known as a data lake or an enterprise data hub, where various types of data can be kept till newer use cases are discovered for such data. Such storage also allows for “active archival” of infrequently used data from (Enterprise Data Warehouse) EDW, thus allowing only the most valuable (and usually most recent) data to be kept in the EDW.

With ELT loads driving up to 80% of database capacity, Hadoop can also be used as a staging area for data preparation and ELT to allow offloading data processing from the EDW

With the big data deluge, and its 3 equally challenging dimensions of volume, velocity and variety – existing conventional platforms are finding it difficult to meet all of an organization’s data warehousing needs along with ETL processing times and availability SLAs. The balance of power keeps tilting towards Hadoop with newer tools and appliances extending the capabilities of Hadoop and with its superior price/performance ratio, building a data warehouse leveraging Hadoop needs to be given serious consideration.

Read the series on Big Data: Part-1 : Basics, Part-2 : Hadoop, Part-3 : Hadoop data warehouse and Part-4 : NoSQL

Basics of Big Data – Part 2 – Hadoop

As discussed in Part 1 of this series, Hadoop is the foremost among tools being currently used for deriving value out of Big Data. The process of gaining insights from data through Business Intelligence and analytics essentially remains the same. However, with the huge variety, volume and velocity (the 3Vs of Big Data), it’s become necessary to re-think of the data management infrastructure. Hadoop, originally designed to be used with the MapReduce algorithm to solve parallel processing constraints in distributed architectures (e.g. web indexing) of web giants like Yahoo or Google, has become the de-facto standard for Big Data (large-scale data-intensive) analytics platforms.

What is Hadoop?

Think of Hadoop as an operating system for Big Data. It is essentially a flexible and available architecture for large scale computation and data processing on a network of commodity hardware.

Conceptually, the key components of the Java-based Hadoop framework are a file store and a distributed processing system:

1. Hadoop Distributed File System (HDFS): provides reliable (fault-tolerant), scalable, low-cost storage

2. MapReduce: Batch-oriented, distributed (parallel) data processing system with resource management and scheduling

As of October 2013, the 2.x GA release of Apache Hadoop also included an enhancement – a key third component:

3. YARN: a general purpose resource management system for Hadoop to allow MapReduce and other data processing services

Hadoop architecture stack

Open-source Hadoop is an Apache project. There are however commercial distributions of Hadoop (similar to UNIX distros) most notably from Cloudera, Hortonworks, MapR, IBM, Amazon etc. The Hadoop ecosystem has several projects in development, seeking to enhance the Hadoop framework to make it more suited to performing Big Data tasks including ETL and analytics.

The key components of the Hadoop distribution:

1. Distributed file system and storage – HDFS

HDFS – a Java based file system providing scalable and reliable data storage, designed to span large clusters of commodity servers

2. Data integration – Flume, Sqoop

Flume – service for integrating large amounts of streaming data (e.g. logs) into HDFS

Sqoop – tool for transferring bulk data between Hadoop and structured databases e.g. RDBMSes

3. Data access – HBase, Hive, Pig, Impala (CDH version for interactive SQL query), Storm, MapReduce jobs in Java/Python etc.

HBase – a non-relational (NoSQL) columnar database running on top of HDFS.

Hive – a data warehouse infrastructure built on Hadoop, providing a mechanism to project structure onto the data  and query it using SQL like language – HiveQL

Pig – allows writing complex MapReduce jobs using a scripting language – PigLatin

Impala – SQL query engine running natively in Hadoop, allows querying data in HDFS and HBase. It is part of Cloudera’s CDH distribution.

Storm – provides real-time data processing capabilities to Hadoop which is traditionally batch oriented (based on MapReduce).

4. Operations– Oozie, Ambari, ZooKeeper

Oozie – Java web application used to schedule Hadoop jobs

Ambari – Framework and tools to provision, manage and monitor Hadoop clusters

ZooKeeper – provides operational services for Hadoop – e.g. distributed configuration service, named registry, synchronization service etc.

5. Resource management – YARN

YARN – separates the resource management and processing components in Hadoop 2.x which used to be done in MapReduce packages in Hadoop 1.x

A schematic of Cloudera’s Hadoop distribution (CDH) is shown below:


Why Hadoop?

Hadoop has gained immense traction in a very short amount of time and is proving useful in a range of applications, including deriving insights from Big Data analytics.

The key advantages of Hadoop as a data processing platform are:

1. Scalability and availability – Due to its ability to store and distribute extremely large datasets across hundreds of inexpensive servers operating in parallel, Hadoop offers extreme scalability. With high-availability HDFS feature in Hadoop 2.0 providing redundant namenodes for standby and failover, Hadoop now also provides high availability

2. Cost-effectiveness – Due to its design incorporating fault-tolerance and scale-out architecture, Hadoop clusters can be built with relatively inexpensive commodity hardware instead of costly blade servers, thereby providing great savings for storage and computing abilities on a per TB basis.

3. Resilience – With built-in fault tolerance, e.g. multiple copies of data replicated on cluster nodes, and with high availability HDFS in version 2.0, Hadoop provides cost-effective resilience to faults and data loss.

4. Flexibility and performance – Ability to access and store various types of data – both structured and unstructured, with no constraints of schema-on-write, along with the emergence of new ways of accessing and processing data – e.g. Storm for real-time/streaming data, SQL-like tools including Impala, Hadapt, Stinger etc.

Due to these key advantages, Hadoop lends itself to several data processing use cases. Key use cases are:

1. Data store / Enterprise data warehouse (EDW) – cost-effective storage for all of an organization’s ever expanding data

2. Active archive – allowing cost-effective querying on historical data from archival systems

3. Transformation – executing data transformations (T step of ETL/ELT) for improved throughput and performance

4. Exploration – allows fast exploration and quicker insights from new questions and use cases, taking advantage of Hadoop’s schema-on-read model instead of schema-on-write models of traditional relational databases

5. Real-time applications – usage of flexible add-ons like Storm to provide dynamic data mashups

6. Machine learning, data mining, predictive modeling and advanced statistics

The early adopters of Hadoop are the web giants like Facebook, Yahoo, Google, LinkedIn, Twitter etc.

Facebook uses Hadoop – Hive and HBase for data warehousing (over 300 PB in aggregate and over 600 TB daily data inflows) and real-time application, serving up dynamic pages customized for each of its over 1.2 billion users.

Yahoo uses Hadoop and Pig for data processing and analytics, web search, email antispam and ad serving with more than 100,000 CPUs in over 40,000 servers running Hadoop with 170 PB of storage .

Google had used MapReduce to create its web index from crawl data and also uses Hadoop clusters on its cloud platform with Google Compute Engine (GCE).

LinkedIn uses Hadoop for data storage and analytics driving personalized recommendations like “People you may know” and ad targeting.

Twitter uses Hadoop – Pig and HBase for data visualization, social graph analysis and machine learning.

Limitations of Hadoop

While Hadoop is the most well-known Big Data solution, it is just one of the components in the Big Data landscape. While in theory, Hadoop is infinitely scalable and resilient and allows a great deal of flexibility in storing structured and unstructured data, in practice, there are several considerations to be taken care of while architecting Hadoop clusters due to the inherent limitations of Hadoop.

1. Workloads – Hadoop is suitable for various types of workloads, however mixed workloads or situations where the workload may vary widely or is not known ahead, makes it difficult to optimize the Hadoop architecture.

2. Integration – Hadoop should not be a stand-alone solution, else it will quickly become a data silo unconnected with the rest of the data management infrastructure. The Hadoop strategy needs to fit into the overall data management and processing framework of the organization to allow for growth and maintenance while not sacrificing on flexibility and agility

3. Security – In the enterprise, security is a big deal. While Hadoop was originally built without a security model, the Hadoop ecosystem is evolving with various projects for security, including Kerberos authentication, the Sentry offering from Cloudera, Project Rhino from Intel, Apache Knox as reverse proxy (with contribution from Hortonworks) or using Apache Accumulo for cell-level security;  however most are complex to setup and there is still no reference standard across deployments.

4. Complexity – the complexity of Hadoop as a Big Data platform lies in its evolving ecosystem of newer technologies, with most data warehousing and analytics specialists skilled in traditional relational databases, SQL and techniques which are difficult to use on Hadoop due to the lack of tools (e.g. still evolving SQL access) and the need for additional skills including data mining or advanced statistical techniques.

5. Availability – Up until the 2.0 release, Hadoop with single-master nodes in HDFS and MapReduce was subject to single point of failure.

6. Inefficiency – HDFS is inefficient for handling small files thereby making analysis on smaller datasets extremely inefficient. This is especially painful while designing models or finding patterns on smaller datasets. MapReduce is also a batch-oriented architecture not suitable for real-time access, but this is being addressed with tools like Storm. Tools like Impala provide interactive SQL-like querying on HDFS, which helps in improving quick adhoc analysis on smaller datasets.

7. Processing framework – Not all data processing problems or analytic questions can be designed with the MapReduce framework. Hadoop is therefore ill suited for such problems which cannot be expressed as problems with Map and Reduce steps and need other data processing paradigms. There are improvements being developed with Storm for real-time access or Spark for improving the data analytics performance with in-memory distributed computing to get around these issues.

In the next parts of this series, I will explore topics of building a Hadoop data warehouse, big data analytics with tools like R as well as other Big Data solutions, Hadoop enhancements  and alternatives to Hadoop.

Read the series on Big Data: Part-1 : Basics, Part-2 : Hadoop, Part-3 : Hadoop data warehouse and Part-4 : NoSQL

Basics of Big Data – Part 1

You can’t miss all the buzz about Big Data! Over the past few years, the buzz around the cloud and Big Data shaping most of the future of computing, IT and analytics in particular has grown incessantly strong. As with most buzz words, which are then hijacked by marketing to suit their own products’ storylines, but which nonetheless manage to confuse users in business and staff in IT as well, Big Data means several things to several people.

So what exactly is Big Data?

Big Data refers to the enormously large datasets that are challenging to store, search, share and analyze with conventional data storage and processing systems. While these challenges remain, our ability to generate such large datasets have grown exponentially over the past few years. With the march of the digital age, and the growing popularity of social networking, the amount of data generated today is growing enormously. Not only within public domains like Google, Facebook, Youtube and Twitter but also within organizations, the amount of data being generated with more powerful servers, softwares and applications far exceeds our capacity to effectively analyze and make sense of this data.

The table below shows the growth of data and the new terminology that has been coined to address the growth of data volumes.

Amount of data



Real-world analogy

103 bytes

Kilobytes (kB)

1.44 MB High-density Floppy disk

Files, Folders

106 bytes

Megabytes (MB)

Disks, tape

Folders, Cabinets

109 bytes – 1012 bytes

Gigabytes (GB) – Terabytes (TB)

Disk arrays


1015 bytes

Petabytes (PB)


1018 bytes

Exabytes (EB)


1021 bytes

Zettabytes (ZB)


1024 bytes

Yottabytes (YB)


Volume, Velocity, Variety & Value

The 3Vs of Big Data – volume, velocity and variety have been popularized by Gartner’s analysis. Gartner defines Big Data as “high volume, velocity, and variety information assets that demand cost-effective, innovative forms of information process for enhanced insight and decision making.” What it essentially means is Big Data beyond the high volumes, moves too fast and is not always structured according to conventional database architectures. For example, multimedia content uploaded on YouTube or comments on Twitter or Facebook, coupled with the velocity at which it is generated and churned makes it obvious that this data is not in a structured format for conventional data processing. To analyze and gain insights from this data, rather derive “value”, would require a wholly different approach.

Old wine in new bottle? Data mining and Big Data Analytics

A modeling approach is required to derive value out of Big Data. A hypothesis is proposed, statistical models are created and validated or updated using data. It is interesting to note that this approach bears substantial resemblance or even overlap with “data  mining”. For those unfamiliar with the term, the more obscure and geeky part of business intelligence and information management, “data mining”  is the process of discovering patterns in large datasets, usually in data warehouses, involving several or all of methods and tools of statistics, databases, machine learning and artificial intelligence. This hypothesis-model-validate and refine approach for deriving value out of Big Data, could be manual with help from data analysts or specialists (data scientists), but could also be “machine-based” depending on adaptive machine-learning. It is important to understand due to the velocity of the data, the algorithm for deriving value could be short-lived and actions based on the insights may need to be implemented rather quickly.

As an illustration, consider minor changes done by Google in its algorithms to serve ads in its search results, collection of a large dataset based on usage for a limited period (maybe a few hours) and analyzing it to understand user response to specific Adwords across dimensions like demographics, geolocation, events, timing etc. can provide Google valuable insights on how to tweak its algorithms to serve advertising and generate the maximum revenue out of it.

The rise of the “Data Scientist”

While the most popular data mining example remains the myth about Walmart’s correlation of beer purchases with diapers, there are celebrated stories about statisticians at other retailers like Target using data mining to understand users’ buying patterns helping focus marketing efforts on target customers. Organizations now understand that there is hidden value in the Big Data haystack which can be leveraged for competitive advantage. With the amount of adaptive machine learning and automation, it used to be argued even a few years ago whether theory and modeling by analysts would be needed at all and whether the sheer volume of Big Data was sufficient to measure and predict patterns. However, it has slowly been understood that due to the breadth and depth of data to be examined, and to ensure correlation between the business context and the data being analyzed, there needs to be a key role played by humans in making sense of Big Data. Though not exclusively tied to Big Data projects, the data scientist role is the new role envisaged for analyzing data across multiple sources and delivering insights related to business problems. In essence, a data scientist is a marriage of 2 roles: the business analyst and the data mining statistician. A data scientist typically has similar backgrounds as a data analyst, being trained in statistics, mathematics, modeling and analytics as well as having strong business acumen and communication skills to convey highly technical analyses in business terms to both business and IT leadership. The data scientist has been variously cited as being an awesome nerd to having the sexiest job in the 21st century.

The tools for Big Data analytics

As with any new technology and hype cycles, enterprises are careful and calibrate their involvement in joining the Big Data bandwagon. It is no wonder that organizations which generate most of the Big Data in public domain e.g. social networks like Facebook or Google, also make the most use of analytics to derive value from Big Data. e.g. Google used Big Data to identify its famous hiring puzzles and brain-teasers were useless.  Facebook currently collects 320TB data each day working with Hadoop and Hive and is adding on a relational data warehouse of around 250PB. In fact, a lot of the open source tools and technologies in use today have been developed by these organizations. These include Hadoop which is an open-source Apache project, allowing storage of enormous datasets across distributed clusters of servers and running distributed analysis applications on the clusters. It utilizes the MapReduce parallel processing programming model, originally developed by Google to process large datasets on a cluster or distributed clusters.

I will explore details of Hadoop, Hive, HDFS, NoSQL, Spark, Impala and other technologies related to Big Data in the second part of this post. I will also explore the unique challenges of Big Data architecture including rapid use, rapid interpretation, lack of schema to sync across tools and other data, data quality and governance and integrating Big Data analytics into the enterprise architecture.

Read the series on Big Data: Part-1 : Basics, Part-2 : Hadoop, Part-3 : Hadoop data warehouse and Part-4 : NoSQL

Review of the BT Summit – Cloud computing, SOA and BI tracks

I attended the Business Technology Summit in Bangalore last week – 3rd and 4th November. There were 3 tracks on cloud computing, Service Oriented Architecture and Business Intelligence, and I chose a mix of sessions across each.

Overall impression: The BT Summit was heavily focused on cloud computing with half of second day having a deep dive into Amazon’s EC2 cloud offering, and several keynotes. SOA and web services, REST and similar architectural sessions were interspersed but definitely not a first-class citizen. BI came a poor third with a poor choice of sessions, and more of a rehash of what is out there for everyone, rather than something on the cutting-edge including use of appliances and columnar databases, as also in-memory databases and use of Flash and AJAX for interactive BI front-ends.

Session-wise review: (Speaker profiles available here). I was able to speak to and ask questions of Vinod Kumar, Vijay Doddavaram, Abhinav Agarwal and Dr. Bob Marcus.


Probably the highlight of the keynotes, this was a pep-talk about the inevitable interconnected future with smart products and services and for good measure Charney threw out some statistics on broadband growth and bandwidth usage and India’s readiness and potential in the scheme of things.

The worst of the lot – this started by comparing the spectrum of offerings in the cloud from Amazon’s DIY EC2 and AWS, Google appengine and apps to Microsoft’s Azure and ended up as a promo touting Azure as the best buy among all.

A very good keynote, focusing on what makes sense to migrate to the cloud and what doesn’t, what are the hidden costs, the myth of unlimited elasticity in the cloud and what Yahoo is doing to use open source software like Hadoop and Hive for cloud computing. In the short time span, Shouvick also tried to address some of the other considerations – including re-architecting existing applications, availability, data storage and movement considerations.

This post-lunch keynote by Sharma was a rambling talk on how technology keeps redefining our lives, and why it is important to think outside-the-box. He used the example of the iPhone to illustrate how such thinking has the potential to alter the established rules of the industry and redefine it as we know it.

Puhlmann provided the security perspective on how easy it to break/hack enterprise systems and how anti-virus and anti-spyware are always playing catch-up, the entire economy that is spawned by the “bad-guys” in technology and why our systems need to be smart and be built from the ground-up for security rather than as an afterthought. He provided valuable insights into what questions we should ask ourselves as we embrace cloud computing, the changing technology landscape making it easy for consuming information but easier still for the security breachers. Puhlmann concluded by suggesting it may be worthwhile including a level of risk assessment and mitigation, and collaboration with ethical hackers, rather than trying to do the impossible of removing all security threats.

Barely managed to sleep through it – this one talked about moving towards a virtual enterprise – with a focus on virtualized architecture, including cloud computing. As boring as they can get.

Other sessions:

  • SOA, Composite Applications, and Cloud Computing: Three pillars of a modern technology solution by Robert Schneider

Robert  Schneider presented the different facets of SOA, Composite applications (superset of mash-ups) and Cloud computing and contrasted them regarding the time to yield benefits, the maturity of the vision, involvement and buy-in from business and where they lie in the tactical-strategic plane. There wasn’t anything regarding why we are stuck with these three for a modern technology solution, or what other paradigms are out there beyond the old-world enterprise computing framework, possibly due to time constraints.

  • Self-service analysis and the future of Business Intelligence by Vinod Kumar

A lot of the BI folks were waiting for this, as Vinod performed the Project Gemini (Office 2010 Excel and PowerPivot) demo live for the first time in India, with several folks, including yours truly, sitting on the stairs. [We have had to rely on Youtube videos and MS Office 2010 preview videos earlier]. The demo was impressive fetching over 13 million records into Excel using a standard DDR laptop, using compression and in-memory technologies. The bigger question around unleashing another round of Excel hell went unanswered due to time constraints, however the presentation probably hinted at Microsoft’s vision of “self-service BI” or so-called “underground-BI” as the power-users of Excel (estimated at 2M worldwide, at 4% of the Excel user base) have been doing. Microsoft’s strategy around pushing SharePoint adoption in the Enterprise was made clear tacitly with SharePoint being the only “portal” to publish and share BI analysis (typical size of these Excel spreadsheets is upwards of 200MB) with other users in the enterprise.

  • Designing and Implementing RESTful web services by Eben Hewitt

Eben Hewitt started off with a very brief comparison between SOAP (Simple Object Access Protocol) modeled more on the lines of RPC (Remote Procedure Call) and REST (Representational State Transfer) and clarified that REST is more an architectural style rather than specifications. The remainder of the talk delved into details of implementation of REST – usage of simple ‘verbs’ and complexity in ‘nouns’, uniform interface, using named resources, java REST frameworks like Jersey, MIME types – JSON, XML, YAML and HTTP operations supported – POST, GET, PUT and DELETE.

I attended with some expectations on how a BI project can be executed possibly with open-source or free software like MySQL/Postgres, Pentaho/Talend, Jaspersoft/MicroStrategy reporting suite etc., but was highly disappointed by the presentation. Ramaswamy spoke on BI usage, barriers to BI adoption, costs of BI implementation and spewed statistics like m&m’s with cursory references to Forrester, Gartner and “research studies”, but there wasn’t anything tangible on how to go about a project execution except for some common-sense talk on “evaluating options” between open-source and licensing costs, offshoring and outsourcing, RDBMS vs. analytica databases and appliances etc.

  • Business Intelligence – Leveraging and Navigating during current challenging times by Vijay Doddavaram

Vijay spoke of the current global economic downturn and how it had taken everyone unawares during the downturn as well as when the current quarter the tide seems to have returned. With the example of a fictitious company in China, he illustrated the importance of trade-off between tactical and strategic decision making and whether and how business intelligence can make a difference in either a downturn and the upswing (whether it is a U, V, or a W curve). Thought-provoking, one couldn’t help feel that BI software has not yet eliminated the “intelligence” that people bring to the table, and made a distinct point about the “human analysis/intelligence” against the out-of-the-box actionable-intelligence marketed by the BI vendors. It would have been interesting to prolong the discussion, with a focus on the “predictive-analytics” offerings in the market (from SAP, WPC, SPSS and the open-source R etc.), we had once again run out of time, and it was the last session of the day as well.

  • Towards a unified Business Intelligence and Enterprise Performance Management Strategy by Abhinav Agarwal

Abhinav is from Oracle and he used this session to basically present the BI and EPM strategy of Oracle. Refreshing when contrasted with the usual Oracle marketing hype, Abhinav made it a point to stress the difficulty of delivering best-in-breed products due to numerous acquisitions and the inevitable integrations compared to the disruptive start-ups which could be one-trick ponies but nevertheless manage to push the technology envelope. Most of the session focused on Oracle BI server offering and the roadmap of integrating with the Fusion middleware, and brief touchpoints on the capabilities of the Oracle BI server: federated queries (acquired from nQuire, which Siebel systems had acquired, prior to being bought by Oracle), and real-time updates, including Oracle RTD (Real-time Decisions) and the segregation of the BI and EPM software offerings.

  • 10 Things software architects should know by Eben Hewitt

I was able to attend part of it, but for the most part- the bottomline of this talk was the trade-offs architects need to make and understanding there may not be a “solution” to a problem, it may just be “moving the problem” – the idea that each “solution” brings its own issues and tradeoffs into the picture. Being more focused on java APIs and cloud computing frameworks, it could have done better with something related to networks and database architecture in general for audience to relate better (for most of my time, I couldn’t relate to a BI applications and data-warehousing infrastructure).

Being late from an overcrowded dining hall, I was able to attend part of this. Bob spoke of the various public and private initiatives including those from the federal government, NASA Nebula and made the distinction early on between the types of offerings on the cloud: SaaS (Software as a service), IaaS (Infrastructure as a service) and PaaS (Platform as a Service). He mentioned in passing the data.gov and apps.gov initiatives of the Obama administration as also about RACE (Rapid Access Computing Environment) from the Dept. of Defense – Defense Information Systems Agency.

Vivek Khurana did a very short presentation to an overflowing hall on clichéd but nevertheless important aspects of information visualization while designing dashboards: clutter vs. simplicity, proper designing of KPIs, importance of delivery to mobile devices, and learning from news aggregation sites and portals on presentation.

  • Implementing Enterprise 2.0 using Open Source products by Udayan Banerjee

Banerjee did a great job of presenting what his vision of implementing Enterprise 2.0 in NIIT was – implementing SLATES (coined by Andrew McAfee) – Search, Links, Authoring, Tags, Extensions and Signals. Within half-an-hour he navigated us through using open-source products for collaboration using blogs and wiki (MediaWiki), using single-sign-on with enterprise databases, using links and tag clouds and integrating Search as well as implementing a text-based instant messenger.

I had missed the earlier session of Alan on lessons learnt using SharePoint, so I made it a point to attend the last of this at the summit – even though it meant I had no clue sometimes of what was being talked about! Alan spoke of the emergence of the multi-vendor CMIS standard for Enterprise Content Management – the various facets of ECM – from digital and media assets, email archiving, Internet content, web analytics, document types, rich media and the problems with the earlier Java standards like JSR 170 – most notably the absence of support from Microsoft. He also spoke about the vendor landscape and a 9-block rating similar to Gartner’s magic quadrant, plus various other important standards, including XAM – eXtensible Access Method – a storage standard developed by SNIA (Storage and Networking Industry Association)

Presentation files: Most presentation files are available here. You’ll need to register though to download.

– Maloy