top of page
Search

# The Menace of Means

A statistician is someone who has their head in the oven and their feet in the freezer, but “on average” feels fine.

It’s funny, but you may be doing this inadvertently—when the data you analyze are not normally distributed.

Normally distributed data (also known as Gaussian distributed) are symmetrically distributed around the mean (average), with the majority of the data falling close to the mean and decreasing in frequency as you move further away from it.

People intuitively understand normal distributions. They are commonly encountered when measuring things like heights (below), test scores, and weather. You can summarize normally distributed data with mean and standard deviation (a measure of width).

Some distributions are skewed because their data can’t take on negative values but are unbounded in the positive side. A count of events in some period for example. These are called Poisson distributions. The curves below are Poisson distributions for different means. As the mean gets further away from zero, the distributions approach a normal curve.

Counts of sales wins and opportunity starts in a period are often Poisson distributed. Sales transaction sizes often are a combination of several Poisson distributions with different means. Here is an example of a distribution of deal sizes. That peak on the left looks very similar to the first or second Poisson curve above.

Non-normal distributions are harder to understand and can’t be summarized easily. This population has a mean of \$6K and standard deviation \$20K. The intuitive interpretation of standard deviation does not translate well to this data. Mean and standard deviation statistics on non-normally distributed data can be misleading.

One approach to deal with this problem is to use median—the middle value in a dataset (\$13K here). This sometimes addresses the outlier-skewing-the-mean problem. But it does not capture the breadth of the distribution, or that (in this example above) there are six peaks (modes).

Mean, median, and standard deviation don’t intuitively convey the nature of non-normal distributions. You get a better sense by just looking at the chart.

Imagine you are asked to track average deal-size (by salesperson) over time. Several problems:

1. The data are not normally distributed. Often there are multiple modes. Statistics like average, median, and standard deviation are not helpful.

2. Segmenting by salesperson reduces the data in each segment. Distributions are more skewed and variable (noisy). You can’t see trends through noise.

3. Analyzing over a recent time-interval further exacerbates 2 above. A longer interval may help reduce noise but this is a highly lagged indicator.

What to do?

• Try to segment the data to find groups that are normally distributed. This is hard to do but if you have a reliable segmentation variable, use it and report accordingly.

• If you can’t find normally distributed groups, don’t report means! They’re almost meaningless.

• If people still insist on comparing different groups/salespeople, show the actual distribution for each population along with a time series showing the trend.

• If you are computing means or medians to forecast or to make decisions, you should use a Monte Carlo analysis to simulate the distribution of possible outcomes.

Monte Carlo uses random sampling to analyze and predict outcomes. It is a powerful tool for decision-making and risk assessment of complex systems (such as sales activity) that traditional methods may find difficult or impossible to analyze.

With a little skill, you can do Monte Carlo simulations in a spreadsheet as shown in this example.

Monte Carlo example
.xlsx