Category Archives: Statistics

A Brief Introduction to Statistics – Part 3 – Statistical Inference

Statistical inference is concerned primarily with understanding the quality of parameter estimates.

Statistical Inference

Statistical Inference

The sampling distribution represents the distribution of the point estimates based on samples of a fixed size from a certain population. It is useful to think of a particular point estimate as being drawn from such a distribution. Understanding the concept of a sampling distribution is central to understanding statistical inference.

A sample statistic is a point estimate for a population parameter, e.g. the sample mean is used to estimate the population mean. Note that point estimate and sample statistic are synonymous. Recognize that point estimates (such as the sample mean) will vary from one sample to another, and define this variability as sampling variability (sometimes also called sampling variation).

The standard deviation associated with an estimate is called the standard error. It describes the typical error or uncertainty associated with the estimate. Given n independent observations from a population with standard deviation σ, the standard error of the sample mean is equal to SE= σ/sqrt(n)
Note that when the population standard deviation σ is not known (almost always), the standard error SE can be estimated using the sample standard deviation s, so that SE= s/sqrt(n)
A reliable method to ensure sample observations are independent is to conduct a simple random sample consisting of less than 10% of the population.

Difference between standard deviation and standard error
Standard deviation measures the variability in the data, while standard error measures the variability in point estimates from different samples of the same size and from the same population, i.e. measures the sampling variability. When the sample size (n) increases we would expect the sampling variability to decrease.

Confidence Intervals

A plausible range of values for the population parameter is called a confidence interval. 95% confidence interval means, if we took many samples and built a confidence interval from each sample,then about 95% of those intervals
would contain the actual mean, µ.
Confidence level is the percentage of random samples which yield confidence intervals that capture the true population parameter.

If the point estimate follows the normal model with standard error SE, then a confidence interval for the population parameter is: point estimate ± z* SE where z* corresponds to the confidence level selected.
In a confidence interval, z* SE is called the margin of error (corresponds to half the width of the confidence interval).

Central Limit Theorem
If a sample consists of at least 30 independent observations and the data are not strongly skewed, then the distribution of the sample mean is well approximated by a normal model.
Conditions for \bar{x} being nearly normal and SE being accurate:
1. The sample observations are independent.
2. The sample size is large: n ≥ 30 is a good rule of thumb.
3. The distribution of sample observations is not strongly skewed.
The larger the sample size (n), the less important the shape of the distribution becomes, i.e. when n is very large the sampling distribution will be nearly normal regardless of the shape of the population distribution.

Hypothesis Testing Framework
The null hypothesis (H0) often represents either a skeptical perspective or a claim to be tested. The alternative hypothesis (HA) represents an alternative claim under consideration and is often represented by a range of possible parameter values.
Double negatives:
In many statistical explanations, we use double negatives. For instance, we might say that the null hypothesis is not implausible or we failed to reject the null hypothesis. Double negatives are used to communicate that while we are not rejecting a position, we are also not saying it is correct.
Always construct hypotheses about population parameters (e.g. population mean, μ) and not the sample statistics (e.g. sample mean, x’). Note that the population parameter is unknown while the sample statistic is measured using the observed data and hence there is no point in hypothesizing about it.
Define the null value as: the value the parameter is set to equal in the null hypothesis.
Note that the alternative hypothesis might be one-sided (μ the null value) or two-sided (μ≠ the null value), and the choice depends on the research question.

p-value: A conditional probability to quantify the strength of the evidence against the null hypothesis and in favor of the alternative. The p-value is the probability of observing data at least as favorable to the alternative hypothesis as our current data set, if the null hypothesis is true.
p-value = P(observed or more extreme sample statistic | H0 true)
The p-value quantifies how strongly the data favors HA over H0 . A small p-value (usually less than significance level α < 0.05) corresponds to sufficient evidence to reject H0 in favor of HA.
Note that we can never “accept” the null hypothesis since the hypothesis testing framework does not allow us to confirm it.

The conclusion of a hypothesis test might be erroneous regardless of the decision we make.
A Type 1 error is rejecting the null hypothesis when the null hypothesis is actually true.
A Type 2 error is failing to reject the null hypothesis when the alternative hypothesis is actually true.
Probability of making a Type 1 error is equivalent to the significance level α. Use a smaller α if Type 1 error is relatively riskier. Use a larger α if Type 2 error is relatively riskier.

The Central Limit Theorem states that when the sample size is small, the normal approximation may not be very good. However, as the sample size becomes large, the normal approximation improves.

When to retreat
Statistical tools rely on conditions. When the conditions are not met, these tools are unreliable and drawing conclusions from them is treacherous. These conditions come in two forms:
1. The individual observations must be independent.
2. Other conditions focus on sample size and skew.
Verification of conditions for statistical tools is always necessary. We need to learn / devise new methods that are appropriate for the data, if conditions are not satisfied. It’s also important to remember that inference tools won’t be helpful when considering data that include unknown biases, such as convenience samples.


A Brief Introduction to Statistics – Part 2 – Probability and Distributions

Probability concepts form the foundation for statistics.


A formal definition of probability:
The probability of an outcome is the proportion of times the outcome would
occur if we observed the random process an infinite number of times.
This is a corollary of the law of large numbers:
As more observations are collected, the proportion of occurrences with a particular outcome converges to the probability of that outcome.

Disjoint (mutually exclusive) events as events that cannot both happen at the same time. i.e. If A and B are disjoint, P(A and B) = 0
Complementary outcomes as mutually exclusive outcomes of the same random process whose probabilities add up to 1.
If A and B are complementary, P(A) + P(B) = 1

If A and B are independent, then having information on A does not tell us anything about B (and vice versa).
If A and B are disjoint, then knowing that A occurs tells us that B cannot occur (and vice versa).
Disjoint (mutually exclusive) events are always dependent since if one event occurs we know the other one cannot.
A probability distribution is a list of the possible outcomes with corresponding probabilities that satisfies three rules:

  1. The outcomes listed must be disjoint.
  2. Each probability must be between 0 and 1.
  3. The probabilities must total 1.

Using the general addition rule, the probability of union of events can be calculated.
If A and B are not mutually exclusive:
P(A or B) = P(A) + P(B) − P(A and B)
If A and B are mutually exclusive:
P(A or B) = P (A) + P (B), since for mutually exclusive events P(A and B) = 0

If a probability is based on a single variable, it is a marginal probability. The
probability of outcomes for two or more variables or processes is called a joint probability.
The conditional probability of the outcome of interest A given condition B is
computed as the following:
P(A|B) = P(A and B) / P(B)
Using the multiplication rule, the probability of intersection of events can be calculated.
If A and B are independent, P(A and B) = P(A) × P(B)
If A and B are dependent, P(A and B) = P(A|B) × P(B)
The rule of complements also holds when an event and its complement are conditioned on the same information:
P(A|B) = 1 − P(A’ |B) where A’ is the complement of A

Tree diagrams are a tool to organize outcomes and probabilities around the structure of the data. They are most useful when two or more processes occur in a sequence and each process is conditioned on its predecessors.
Bayes Theorem:
P(A1|B) = P(B|A1 )P(A1 ) / { P(B|A1 )P(A1 ) + P(B|A2 )P(A2 ) + · · · + P(B|Ak )P(Ak )} where A1, A2 , A3 , …, and Ak represent all possible outcomes of the first variable and P(B) is the outcome of second variable.
Drawing a tree diagram makes it easier to understand how two variables are connected. Use Bayes’ Theorem only when there are so many scenarios that drawing a tree diagram would be complex.

The standardized (Z) score of a data point as the number of standard deviations it is away from the mean: Z=(x−μ)/σ where μ=mean, and σ=standard deviation. If the tail (skew) is on the left (negative side), we have a negatively skewed distribution and a negative Z score of the median. In a right skewed distribution the Z score of the median is positive.

A random process or variable with a numerical outcome is called a random variable, denoted by a capital letter, e.g. X. The mean of the possible outcomes of X is called the expected value, denoted by E(X).

The most common distribution is the normal curve or normal distribution. Many variables are nearly normal, but none are exactly normal. Thus the normal distribution, while not perfect for any single problem, is very useful for a variety of problems. The normal distribution with mean 0 and
standard deviation 1 is called the standard normal distribution. An often-used thumb rule is the 68-95-99.5 rule, i.e. about 68%, 95%, and 99.7% of
observations fall within 1, 2, and 3, standard deviations of the mean in the normal distribution, respectively.

A Bernoulli random variable has exactly two possible outcomes, usually labeled success(1) and failure(0). If X is a random variable that takes value 1 with probability of success p and 0 with probability 1 − p, then X is a Bernoulli random variable with:

  • mean µ = p
  • and standard deviation σ = sqrt(p(1 − p))

The binomial distribution describes the probability of having exactly k
successes in n independent Bernoulli trials with probability of a success p.
The number of possible scenarios for obtaining k successes in n trials is given by the choose function (n choose k) = n!/(k!(n − k)!)
The probability of observing exactly k successes in n independent trials is given by:
(n choose k) p^k (1 − p)^(n−k) = (n!/(k!(n − k)!)) p^k (1-p)^(n-k)
Additionally, the mean, variance, and standard deviation of the number of observed successes are:
µ = np, σ^2 = np(1 − p), σ = sqrt(np(1-p))
To check if a random variable is binomial, use the following four conditions:

  1. The trials are independent.
  2. The number of trials, n, is fixed.
  3. Each trial outcome can be classified as a success or failure.
  4. The probability of a success, p, is the same for each trial.

The binomial formula is cumbersome when the sample size (n) is large, particularly when we consider a range of observations. In some cases we may use the normal distribution as an easier and faster way to estimate binomial probabilities. A thumb rule to use in such cases is to check the conditions:
np ≥ 10 and n(1−p) ≥ 10
The negative binomial distribution describes the probability of observing the k-th success on the n-th trial: (n-1 choose k-1) p^k(1-p)^(n-k) where p is the probability an individual trial is a success. All trials are assumed to be independent.

The Poisson distribution is often useful for estimating the number of rare events in a large population over a unit of time. Suppose we are watching for rare events and the number of observed events follows a Poisson distribution with rate λ.
P(observe k rare events) = λ^k e^-λ / k!
where k may take a value 0, 1, 2, and so on. e≈2.718, the base of natural logarithm.
A random variable may follow a Poisson distribution if the event being considered is rare, the population is large, and the events occur independently of each other.


A Brief Introduction to Statistics – Part 1

What is Statistics?
Collected observations are called data. Statistics is the study of how best to collect, analyze, and draw conclusions from data. Each observation in data is called a case. Characteristics of the case are called variables. With a matrix/table analogy, a case is a row while a variable is a column.

Statistics - Correlation

Statistics – Correlation (Courtesy:

Types of variables:
Numerical– Can be discrete or continuous, and can take a wide range of numerical values.
Categorical– Specific or limited range of values, usually called levels. Variables with natural ordering of levels are called ordinal categorical variables.
A pair of variables are either related in some way (associated) or not (independent). No pair of variables is both associated and independent.

Data collected in haphazard fashion are called anecdotal evidence. Such evidence may be true and verifiable, but it may only represent extraordinary cases.

There are two main types of scientific data collection:
Observational studies – collection of data without interfering with how the data has arisen. Can provide evidence of a naturally occurring association between variables, but by themselves, cannot show a causal connection.
Experiments – randomized experiments, usually with an explanatory variable and a response variable are performed, often with a control group.
In general, correlation does not imply causation, and causation can only be inferred from a randomized experiment.

Types of sampling:
Simple random sampling: Each subject in the population is equally likely to be selected.
Stratified sampling: The population is first divided into homogeneous strata (subjects within each stratum are similar, but different across strata) followed by random sampling from within each stratum.
Cluster sampling: The population is first divided into groups or clusters (subjects within each cluster are non-homogeneous, but clusters are similar to each other). Next a few clusters are randomly sampled followed by random sampling from within each cluster.

Randomized experiments are generally built on four principles:
Controlling – control any differences between groups for confounding variables which are known and can be accounted for.
Randomization – randomize population into groups to account for variables that cannot be controlled.
Replication – collect sufficiently large sample or replicate entire study to improve estimation.
Blocking – advanced technique of grouping population based on variable known/suspected to influence response, followed by randomizing cases within the group.
Reducing bias in experiments –
Randomized experiments are the gold standard for data collection, but they do not ensure an unbiased perspective into the cause and effect relationships in all cases. Blinding can help in overcoming placebo effect in human studies.

Distributions of a numerical variable are described by shape, center and spread. The three most commonly used measures of center and spread are:
center: mean (the arithmetic average), median (the midpoint), mode (the most frequent observation)
spread: standard deviation (variability around the mean), range (max-min), interquartile range IQR (middle 50% of the distribution)
An outlier is an observation that appears extreme relative to the rest of the data.
A robust statistic (e.g. median, IQR) is a statistic that is not heavily affected by skewness and extreme outliers.

Comparing categorical data:
A table that summarizes data for two categorical variables in this way is called a contingency table. A table for a single variable is called a frequency table. A bar plot is a common way to display a single categorical variable. A segmented bar plot is a graphical display of contingency table information. A mosaic plot is a graphical display of contingency table information that is similar to a bar plot for one variable or a segmented bar plot when using two variables. While pie charts are well known, they are not typically as useful as other charts in a data analysis.

Comparing numerical data:
The side-by-side box plot is a traditional tool for comparing across groups. Another useful plotting method uses hollow histograms to compare numerical data across groups.

Hypothesis test:
H0 Independence model – Explanatory variable has no effect on response variable, and we observed a difference that would only happen rarely.
HA Alternative model – Explanatory variable has an effect on response variable, and what we observed was actually due to explanatory variable effect on the response variable explaining the difference.
Based on the simulations, we have two options:
1. We conclude that the study results do not provide strong evidence against the independence model.
2. We conclude the evidence is sufficiently strong to reject H0 and assert the alternative hypothesis.
When we conduct formal studies, usually we reject the notion that we just happened to observe a rare event. So in such a case, we reject the independence model in favor of the alternative.

Statistical Inference:
One field of statistics, statistical inference, is built on evaluating whether such differences are due to chance. In statistical inference, statisticians evaluate which model is most reasonable given the data. Errors do occur, just like rare events, and we might choose the wrong model. While we do not always choose correctly, statistical inference gives us tools to control and evaluate how often these errors occur.

SPC – Using statistics to get insight from BI

There is a well known adage that if you keep doing the same thing and expect different results, that is a sure sign of idiocy.  In the BI world too, we come across several instances where people take it for granted that the ‘BI tool’ will magically generate insight and spur ‘intelligence’ rather than ‘idiocy’. Yet the very practices of reporting the same measures, or of creating reports for metrics just because they are now made available by the tool, without sparing any ‘intelligence’ into what will generate insight is a major cause  of failures of BI.  Most of the leading commercial BI products are expensive and cost a lot of money in maintenance and support, so it is rather important to understand how to design the proper metrics and KPIs (key process indicators) which would generate insight. Even more important is to have a process focus and a general idea of the basics of statistical process control, in order to make sure that the right decisions are made and resources are spent on processes and strategies where they would have the most impact.

Statistical Process Control (SPC) is quite well known in the manufacturing industry and also in software engineering. In effect, it applies rules of statistics to the processes that are followed to predict whether a process is stable (and therefore in control) and its output is predictable or not and how to identify out-of-control processes and take corrective measures. Quality aids like causal analysis done using brainstorming/ nominal group techniques/ Ishikawa diagrams or fish-bone analysis are helpful in analyzing outliers and reasons of deviation from control limits. A substantive discussion of SPC and quality process areas is not possible in this post so I’ll just touch upon some concepts concisely.

PDCA – Plan-Do-Check-Act cycle, proposed by economist William Shewhart and later by quality guru Dr. Edward Deming. This is the foundation of the management and feedback cycle underlying any software engineering process.

Control limits – Any process which follows the Gaussian normal distribution would have a normal bell-shaped curve and be subject to control limits. The stability of the process can be gauged by the outliers (number and pattern of data points falling outside the control limits).

Causes of deviation: Outliers indicate deviation from a stable and predictable process. Causes of deviation could be due to special causes or common causes. Common causes are like background noise and may be present in stable processes. Special causes must be removed and steps taken to prevent their occurrence to bring a process under control. Common causes may be reduced to have a sharper curve with a narrower band of control limits and have greater control on the process.


Control Chart (Image courtesy: Wikipedia)

Users of BI tools haven’t tapped into the power of SPC to gain insight and control operational processes to the extent possible. There is even danger of damaging with a stable and in-control process due to tinkering with the process based on common-cause variation observed in operational reports. Part of the reason for SPC not gaining sufficient currency is that business analysts are not trained in the basics of SPC or quality processes like DAR (defect analysis and resolution) but mostly it is due to there not being any BI product in the market so far which allows easy use of SPC analysis. It is only of late that vendors like SAP-Business Objects have come out with specific SPC modules and predictive analytics in the BI product marketplace.

BI is a specialized discipline which involves a lot of investment on the part of customers in terms of pre-sale-evaluation (proof-of-concepts / comparisons), implementations, maintenance and support. However the returns from BI implementations are not easy to quantify and ROI (return on investment) figure calculations could be vague and incorrect. Using SPC along with the right quality process framework allows in maximizing the value of BI implementations, as well as provides a ready-reckoner for calculating ROI based on projected process improvements based on statistical control limits.