Set up a Hadoop Spark cluster in 10 minutes with Vagrant

With each of the big 3 Hadoop vendors – Cloudera, Hortonworks and MapR each providing their own Hadoop sandbox virtual machines (VMs), trying out Hadoop today has become extremely easy. For a developer, it is extremely useful to download a get started with one of these VMs and try out Hadoop to practice data science right away.

Vagrant Hadoop Spark Cluster

Set up a Hadoop-Spark cluster with Vagrant in 10 minutes

However, with the core Apache Hadoop, these vendors package their own software into their distributions, mostly for the orchestration and management, which can be a pain due to the multiple scattered open-source projects within the Hadoop ecosystem. e.g. Hortonworks includes the open-source Ambari while Cloudera includes its own Cloudera Manager for orchestrating Hadoop installations and managing multi-node clusters.

Moreover, most of these distributions require today a 64-bit machine and sometimes a high-amount of memory (for a laptop). e.g. running Cloudera Manager with a full-blown Cloudera Hadoop Distribution (CDH) 5.x requires at least 10GB RAM. For a developer with a laptop, RAM is always at a premium, hence it may seem easier to try out the vanilla Apache Hadoop downloads for installations. The documentation for Hadoop for installing a single-node cluster, and even a multi-node cluster is much improved nowadays, but with the hassles of downloading the distributions and setting up SSH, it can easily take up a long-time to effectively set up a useful multi-node cluster. The overhead of setting up and running multiple VMs can also be a challenge. The vanilla distributions also require separate installations for UI (Cloudera Hue being a nice one) and job tracking (Oozie) or orchestration (Ambari). Unfortunately Ambari works only with select versions of HDP (Hortonwork’s distribution of Hadoop), and configuring Oozie with disparate versions of Hadoop, Java and other libraries can be a real pain.

One of the solutions to this problem is to use a container-based approach to installation. Hadoop clusters can be setup with LXC (Linux containers) approach, e.g. with the very popular Docker. There are also other approaches with using Puppet, Ansible, Chef and Salt which allow easy installations. One of the simpler approaches that I tried apart from vanilla Hadoop is using Vagrant. Indeed setting up VMs with Vagrant is a breeze, and with a vagrant script (written in Ruby), setting up a multi-node cluster is very quick. In fact you can get started with a Hadoop and Spark multi-node cluster in less than 10 minutes.

Check out the project on Github – it’s adapted from Jee Vang’s excellent Vagrant project to allow for 32-bit machines,  speed-up with pre-downloads of Hadoop, Spark and Java, and includes an updated Readme with script change locations detailed.


The data science project lifecycle

How does the typical data science project life-cycle look like?

This post looks at practical aspects of implementing data science projects. It also assumes a certain level of maturity in big data (more on big data maturity models in the next post) and data science management within the organization. Therefore the life cycle presented here differs, sometimes significantly from purist definitions of ‘science’ which emphasize the hypothesis-testing approach. In practice, the typical data science project life-cycle resembles more of an engineering view imposed due to constraints of resources (budget, data and skills availability) and time-to-market considerations.

The CRISP-DM model (CRoss Industry Standard Process for Data Mining) has traditionally defined six steps in the data mining life-cycle. Data science is similar to data mining in several aspects, hence there’s some similarity with these steps.

CRISP-DM lifecycle

CRISP-DM lifecycle

The CRISP model steps are:
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation and
6. Deployment

Given a certain level of maturity in big data and data science expertise within the organization, it is reasonable to assume availability of a library of assets related to data science implementations. Key among these are:
1. Library of business use-cases for big data/ data science applications
2. Data requirements – business use case mapping matrix
3. Minimum data quality requirements (test cases to ensure minimum level of data quality to ensure feasibility)

In most organizations, data science is a fledgling discipline, hence data scientists (except those from actuarial background) are likely to have limited business domain expertise – therefore they need to be paired with business people and those with expertise in understanding the data. This helps data scientists gain or work together on steps 1 and 2 of the CRISM-DM model – i.e. business understanding and data understanding.

The typical data science project then becomes an engineering exercise in terms of a defined framework of steps or phases and exit criteria, which allow making informed decisions on whether to continue projects based on pre-defined criteria, to optimize resource utilization and maximize benefits from the data science project. This also prevents the project from degrading into money-pits due to pursuing nonviable hypotheses and ideas.

The data science life-cycle thus looks somewhat like:
1. Data acquisition
2. Data preparation
3. Hypothesis and modeling
4. Evaluation and Interpretation
5. Deployment
6. Operations
7. Optimization

Data Science Project Life-cycle

Data Science Project Life-cycle

Data Acquisition – may involve acquiring data from both internal and external sources, including social media or web scraping. In a steady state, data extraction and transfer routines would be in place, and new sources, once identified would be acquired following the established processes.
Data preparation – Usually referred to as “data wrangling”, this step involves cleaning the data and reshaping it into a readily usable form for performing data science. This is similar to the traditional ETL steps in data warehousing in certain aspects, but involves more exploratory analysis and is primarily aimed at extracting features in usable formats.
Hypothesis and modeling are the traditional data mining steps – however in a data science project, these are not limited to statistical samples. Indeed the idea is to apply machine learning techniques to all data. A key sub-step is performed here for model selection. This involves the separation of a training set for training the candidate machine-learning models, and validation sets and test sets for comparing model performances and selecting the best performing model, gauging model accuracy and preventing over-fitting.

Steps 2 through 4 are repeated a number of times as needed; as the understanding of data and business becomes clearer and results from initial models and hypotheses are evaluated, further tweaks are performed. These may sometimes include Step 5 (deployment) and be performed in a pre-production or “limited” / “pilot” environment before the actual full-scale “production” deployment, or could include fast-tweaks after deployment, based on the continuous deployment model.

Once the model has been deployed in production, it is time for regular maintenance and operations. This operations phase could also follow a target DevOps model which gels well with the continuous deployment model, given the rapid time-to-market requirements in big data projects. Ideally, the deployment includes performance tests to measure model performance, and can trigger alerts when the model performance degrades beyond a certain acceptable threshold.

The optimization phase is the final step in the data science project life-cycle. This could be triggered by failing performance, or due to the need to add new data sources and retraining the model, or even to deploy improved versions of the model based on better algorithms.

Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. As mentioned before, with increasing maturity and well-defined project goals, pre-defined performance criteria can help evaluate feasibility of the data science project early enough in the life-cycle. This early comparison helps the data science team to change approaches, refine hypothesis and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.


BI in the digital era

Sometime back I presented a webinar on BrightTalk. The slides for the talk have now been uploaded on Slideshare. The talk focused more on changes in digital technology disrupting businesses, the effect of Big Data, the FOMO (Fear of missing out) effect on big business – and what it meant for changes to the way we do business intelligence in the digital era.

Key themes:
* Disruption in traditional IT with cloud computing
* Changing economics and changing business models
* Rise of Big Data
* Tech changes to manage Big Data – distributed computing
* Shift from “current-state” to “next-state” questions
* Introducing Data Science
* Challenges – regulatory, data privacy
* Dangers of data science – over-fitting, interpretation
* Managing big data projects
* Data Science MOOCs (massive open online courses), tools and resources

Predictability – Get everything as a service – from IaaS, PaaS and SaaS to XaaS

The outsourcing model which led to the “on-demand” “as a service” model, has taken off with increasing adoption of cloud-computing and mobility. What started out with the SaaS – software as a service model, has now diversified into several other services.

Indeed, cloud computing has come to rest on three of these as its core pillars:

  • SaaS: Software as a Service
  • PaaS: Platform as a Service
  • IaaS: Infrastructure as a Service

Differentiating SaaS, PaaS and IaaS:

SaaS:  Access to software applications, usually for a specific business function, delivered online, with a pricing model which is usually cheaper than on-premises licensing. One of the earliest SaaS examples is’s CRM applications. Today SaaS has extended into several business functions including office productivity with Microsoft’s Office365, accounting and tax from Intuit or project management with Asana or Basecamp, while expanding into with enterprise ERP with SAP or Oracle.

The benefits of SaaS include flexible pricing plans, usually with a pay-as-you-go model and removal of the need for purchasing additional hardware or additional expenses for installations, upgrades and maintenance.

PaaS:  Providing a computing platform as a service is essentially what constitutes PaaS.

The consumer is freed from having to purchase and maintain its own hardware and software stacks. The provider facilitates the deployment of applications, the develop-test-release cycle thus enabling IT in the consumer’s organization to develop its own applications for the final customer. Given its nature, PaaS providers necessarily make choices in the mix of underlying networking, storage, servers, operating systems, middleware and expose these as packages. Flexibility is provided with customizable options for choosing components of the application stack.

For the application development team (consumer), all the complexity of deployment, load balancing, backups, auto-scaling etc. are managed transparently by the PaaS vendor, and the consumer can focus on building the applications/code on top of the platform.

Example: Think of Windows Azure Websites as a PaaS which provides choices for source control (Git/Github/Codeplex/TFS etc.) and web development stacks on .NET/Java/PHP/Python/Node.js etc.  Others include Google App Engine, Pivotal CF, Heroku etc. While Amazon is not usually considered a PaaS vendor (mostly IaaS), it does have some PaaS offerings, e.g. AWS Elastic Beanstalk (free PaaS) for building web applications in java/.NET/Python/Ruby/Docker/Node.js etc.

IaaS:  Providing the underlying infrastructure – including networking, for computing systems, typically over the web using virtualization technologies, is Infrastructure as a service. In the IaaS mode, providers typically manage the networking, storage, servers and virtualization, while consumers have the flexibility to manage the operating system, middleware, application stacks and data.

The foremost example of IaaS provider is Amazon Web Services EC2 (Elastic Cloud Compute). With continued expansion in cloud computing, there are now several other competitors e.g. RackSpace, Microsoft Azure (which includes both IaaS and PaaS) , Google Compute Engine and more. Given its nature, the targeted customer base for IaaS providers is business more than consumers, so it remains mostly a B2B offer.

Differentiating SaaS, PaaS and IaaS

Differentiating SaaS, PaaS and IaaS


There’s no question that cloud computing and mobile have accelerated the adoption of the “as a service” model. While initially targeted at small and medium businesses, which looked to outsource these functions to prevent sunk costs, the “XaaS” model has rapidly moved into the enterprise space due to a rethink of the approach to total cost of ownership and the move to simpler, predictable expenses.

Today, having a cloud computing strategy is essential not only to the CIO, but the CFO as well, because of the transparency and predictable cost structures inherent in the “XaaS” models, compared to the opaque and complex financial models of old.

Historically, on – premise resources like servers, networking equipment, data centers used to be capital costs. Earlier, projects (if there were projects at all) in enterprises would propose benefits based on forecast usage, licensing, upgrade, ,maintenance and support costs. Quick obsolescence of technology meant upgrades for core platforms would be major project exercises in themselves. Finding out a reliably accurate total cost of ownership or even the cost per user would be a challenge. Enterprises sought large outsourcing deals with IT infrastructure and support providers like IBM, Fujitsu, HCL etc. to essentially arrive at predictable costs for the near future (several years according to the duration of the contracts).

Essentially the “XaaS” model has driven this market to develop core competence in IT delivery through the rise of IaaS/PaaS and SaaS service providers. Not only small and medium businesses can outsource their back-office and IT requirements to cloud providers, but large businesses/enterprises can also take advantage of the “XaaS” model to get ongoing predictable costs.

Today there are value-add cloud services enabled by these cloud-providers, which combine expertise of talent pools of skilled resources with “XaaS” offers to provide bundled services for business functions aimed at both the business and the consumer. Zoho, JustEat Asana, BackOps, WorkDay are all examples of business services in the digital age which provide predictability.

Everything as a Service

Cloud computing and rapid evolution of Web2.0 technologies in the digital age has led to a host of start-ups. Typically these start-ups require a range of services from payroll, HT, IT, finance, marketing and so on. Today the “as-a-service” model has thrown up start-ups focusing on providing these services using the cloud to support other start-ups and small and medium businesses.

The “XaaS” ecosystem is not only confined to the digital world. Several brick-and-mortar services and business models are being changed and challenged by this digital world. Think of Uber, BlaBlaCar or Lyft providing taxi or rideshare services, Airbnb providing vacation rentals, Kayak or Google Flights providing travel information, HealthTap or IoraHealth providing telemedicine services or YC and Red Tree Labs providing startup-as-a-service.


Concerns remain with adoption of the cloud computing model. Ranging from security, availability, re-architecting applications for the cloud to lack of adequate support, dependence on network, high costs of storage, bandwidth and data transfer, most of these except data transfer costs are also applicable on-premise. The key underlying concern across these is in fact the loss of control and the fear of redundancy.

For most regular usage in small and medium businesses as also enterprise requirements of “Fast IT”, public cloud computing provides a better alternative.  There’s no lengthy procurement process, and consumers can get started immediately.  Public cloud services also provide well designed and up-to-date service catalogs with smaller usage units compared to in-house IT.

However, in several cases, where legitimate concerns around security and sensitivity of data requires additional oversight and control, private clouds with in-premise infrastructure can be better solutions.


It’s well known that private clouds prove cheaper than public clouds in most cases, not the least being the cost of data transfer (uploads to cloud). However newer and open source technologies like ownCloud, openstack and cloud orchestration services like Cloudify continue to reduce the barriers and costs for private and on-premise clouds.

Cloud computing provides economies of scale whether they’re public or private, and provide extreme flexibility in the case of public clouds. The key reason for adopting the “everything as a service” model however is the predictability it can offer on the costs of such services, with key metrics like cost per user, total cost of ownership and the level of accuracy for forecasts.

With continued cloud computing, mobile and the advent of big data bolstered by cheaper bandwidth, it’s more likely to see the move towards everything as a service in the foreseeable future.

A Brief Introduction to Statistics – Part 2 – Probability and Distributions

Probability concepts form the foundation for statistics.


A formal definition of probability:
The probability of an outcome is the proportion of times the outcome would
occur if we observed the random process an infinite number of times.
This is a corollary of the law of large numbers:
As more observations are collected, the proportion of occurrences with a particular outcome converges to the probability of that outcome.

Disjoint (mutually exclusive) events as events that cannot both happen at the same time. i.e. If A and B are disjoint, P(A and B) = 0
Complementary outcomes as mutually exclusive outcomes of the same random process whose probabilities add up to 1.
If A and B are complementary, P(A) + P(B) = 1

If A and B are independent, then having information on A does not tell us anything about B (and vice versa).
If A and B are disjoint, then knowing that A occurs tells us that B cannot occur (and vice versa).
Disjoint (mutually exclusive) events are always dependent since if one event occurs we know the other one cannot.
A probability distribution is a list of the possible outcomes with corresponding probabilities that satisfies three rules:

  1. The outcomes listed must be disjoint.
  2. Each probability must be between 0 and 1.
  3. The probabilities must total 1.

Using the general addition rule, the probability of union of events can be calculated.
If A and B are not mutually exclusive:
P(A or B) = P(A) + P(B) − P(A and B)
If A and B are mutually exclusive:
P(A or B) = P (A) + P (B), since for mutually exclusive events P(A and B) = 0

If a probability is based on a single variable, it is a marginal probability. The
probability of outcomes for two or more variables or processes is called a joint probability.
The conditional probability of the outcome of interest A given condition B is
computed as the following:
P(A|B) = P(A and B) / P(B)
Using the multiplication rule, the probability of intersection of events can be calculated.
If A and B are independent, P(A and B) = P(A) × P(B)
If A and B are dependent, P(A and B) = P(A|B) × P(B)
The rule of complements also holds when an event and its complement are conditioned on the same information:
P(A|B) = 1 − P(A’ |B) where A’ is the complement of A

Tree diagrams are a tool to organize outcomes and probabilities around the structure of the data. They are most useful when two or more processes occur in a sequence and each process is conditioned on its predecessors.
Bayes Theorem:
P(A1|B) = P(B|A1 )P(A1 ) / { P(B|A1 )P(A1 ) + P(B|A2 )P(A2 ) + · · · + P(B|Ak )P(Ak )} where A1, A2 , A3 , …, and Ak represent all possible outcomes of the first variable and P(B) is the outcome of second variable.
Drawing a tree diagram makes it easier to understand how two variables are connected. Use Bayes’ Theorem only when there are so many scenarios that drawing a tree diagram would be complex.

The standardized (Z) score of a data point as the number of standard deviations it is away from the mean: Z=(x−μ)/σ where μ=mean, and σ=standard deviation. If the tail (skew) is on the left (negative side), we have a negatively skewed distribution and a negative Z score of the median. In a right skewed distribution the Z score of the median is positive.

A random process or variable with a numerical outcome is called a random variable, denoted by a capital letter, e.g. X. The mean of the possible outcomes of X is called the expected value, denoted by E(X).

The most common distribution is the normal curve or normal distribution. Many variables are nearly normal, but none are exactly normal. Thus the normal distribution, while not perfect for any single problem, is very useful for a variety of problems. The normal distribution with mean 0 and
standard deviation 1 is called the standard normal distribution. An often-used thumb rule is the 68-95-99.5 rule, i.e. about 68%, 95%, and 99.7% of
observations fall within 1, 2, and 3, standard deviations of the mean in the normal distribution, respectively.

A Bernoulli random variable has exactly two possible outcomes, usually labeled success(1) and failure(0). If X is a random variable that takes value 1 with probability of success p and 0 with probability 1 − p, then X is a Bernoulli random variable with:

  • mean µ = p
  • and standard deviation σ = sqrt(p(1 − p))

The binomial distribution describes the probability of having exactly k
successes in n independent Bernoulli trials with probability of a success p.
The number of possible scenarios for obtaining k successes in n trials is given by the choose function (n choose k) = n!/(k!(n − k)!)
The probability of observing exactly k successes in n independent trials is given by:
(n choose k) p^k (1 − p)^(n−k) = (n!/(k!(n − k)!)) p^k (1-p)^(n-k)
Additionally, the mean, variance, and standard deviation of the number of observed successes are:
µ = np, σ^2 = np(1 − p), σ = sqrt(np(1-p))
To check if a random variable is binomial, use the following four conditions:

  1. The trials are independent.
  2. The number of trials, n, is fixed.
  3. Each trial outcome can be classified as a success or failure.
  4. The probability of a success, p, is the same for each trial.

The binomial formula is cumbersome when the sample size (n) is large, particularly when we consider a range of observations. In some cases we may use the normal distribution as an easier and faster way to estimate binomial probabilities. A thumb rule to use in such cases is to check the conditions:
np ≥ 10 and n(1−p) ≥ 10
The negative binomial distribution describes the probability of observing the k-th success on the n-th trial: (n-1 choose k-1) p^k(1-p)^(n-k) where p is the probability an individual trial is a success. All trials are assumed to be independent.

The Poisson distribution is often useful for estimating the number of rare events in a large population over a unit of time. Suppose we are watching for rare events and the number of observed events follows a Poisson distribution with rate λ.
P(observe k rare events) = λ^k e^-λ / k!
where k may take a value 0, 1, 2, and so on. e≈2.718, the base of natural logarithm.
A random variable may follow a Poisson distribution if the event being considered is rare, the population is large, and the events occur independently of each other.


Now on Amazon – download the BIguru blog app!

The BIguru BI Blog app is now available on the Amazon AppStore!

To search and download the app, go to the Amazon AppStore and search for “Biguru BI Blog“.
To download and install, you’ll need to follow instructions for your Android smartphone, i.e. you’ll need to “enable unknown sources” as outlined by Amazon.

BIguru BI Blog app

BIguru BI Blog app

Once you’ve downloaded and installed it (your smartphone Anti-Virus should scan the app after installation) by accepting the defaults, you’re free to get updates on new posts from this blog!

The app is powered by the Como App Maker – which makes it simple to create HTML5 apps for all types of smartphones.

Enjoy your app!

A Brief Introduction to Statistics – Part 1

What is Statistics?
Collected observations are called data. Statistics is the study of how best to collect, analyze, and draw conclusions from data. Each observation in data is called a case. Characteristics of the case are called variables. With a matrix/table analogy, a case is a row while a variable is a column.

Statistics - Correlation

Statistics – Correlation (Courtesy:

Types of variables:
Numerical– Can be discrete or continuous, and can take a wide range of numerical values.
Categorical– Specific or limited range of values, usually called levels. Variables with natural ordering of levels are called ordinal categorical variables.
A pair of variables are either related in some way (associated) or not (independent). No pair of variables is both associated and independent.

Data collected in haphazard fashion are called anecdotal evidence. Such evidence may be true and verifiable, but it may only represent extraordinary cases.

There are two main types of scientific data collection:
Observational studies – collection of data without interfering with how the data has arisen. Can provide evidence of a naturally occurring association between variables, but by themselves, cannot show a causal connection.
Experiments – randomized experiments, usually with an explanatory variable and a response variable are performed, often with a control group.
In general, correlation does not imply causation, and causation can only be inferred from a randomized experiment.

Types of sampling:
Simple random sampling: Each subject in the population is equally likely to be selected.
Stratified sampling: The population is first divided into homogeneous strata (subjects within each stratum are similar, but different across strata) followed by random sampling from within each stratum.
Cluster sampling: The population is first divided into groups or clusters (subjects within each cluster are non-homogeneous, but clusters are similar to each other). Next a few clusters are randomly sampled followed by random sampling from within each cluster.

Randomized experiments are generally built on four principles:
Controlling – control any differences between groups for confounding variables which are known and can be accounted for.
Randomization – randomize population into groups to account for variables that cannot be controlled.
Replication – collect sufficiently large sample or replicate entire study to improve estimation.
Blocking – advanced technique of grouping population based on variable known/suspected to influence response, followed by randomizing cases within the group.
Reducing bias in experiments –
Randomized experiments are the gold standard for data collection, but they do not ensure an unbiased perspective into the cause and effect relationships in all cases. Blinding can help in overcoming placebo effect in human studies.

Distributions of a numerical variable are described by shape, center and spread. The three most commonly used measures of center and spread are:
center: mean (the arithmetic average), median (the midpoint), mode (the most frequent observation)
spread: standard deviation (variability around the mean), range (max-min), interquartile range IQR (middle 50% of the distribution)
An outlier is an observation that appears extreme relative to the rest of the data.
A robust statistic (e.g. median, IQR) is a statistic that is not heavily affected by skewness and extreme outliers.

Comparing categorical data:
A table that summarizes data for two categorical variables in this way is called a contingency table. A table for a single variable is called a frequency table. A bar plot is a common way to display a single categorical variable. A segmented bar plot is a graphical display of contingency table information. A mosaic plot is a graphical display of contingency table information that is similar to a bar plot for one variable or a segmented bar plot when using two variables. While pie charts are well known, they are not typically as useful as other charts in a data analysis.

Comparing numerical data:
The side-by-side box plot is a traditional tool for comparing across groups. Another useful plotting method uses hollow histograms to compare numerical data across groups.

Hypothesis test:
H0 Independence model – Explanatory variable has no effect on response variable, and we observed a difference that would only happen rarely.
HA Alternative model – Explanatory variable has an effect on response variable, and what we observed was actually due to explanatory variable effect on the response variable explaining the difference.
Based on the simulations, we have two options:
1. We conclude that the study results do not provide strong evidence against the independence model.
2. We conclude the evidence is sufficiently strong to reject H0 and assert the alternative hypothesis.
When we conduct formal studies, usually we reject the notion that we just happened to observe a rare event. So in such a case, we reject the independence model in favor of the alternative.

Statistical Inference:
One field of statistics, statistical inference, is built on evaluating whether such differences are due to chance. In statistical inference, statisticians evaluate which model is most reasonable given the data. Errors do occur, just like rare events, and we might choose the wrong model. While we do not always choose correctly, statistical inference gives us tools to control and evaluate how often these errors occur.