Subscribe to our mailing list

* indicates required

Market research, survey analysis and productivity.Subscribe Now

You’re not still using cluster analysis are you? Do you wear a seatbelt when you drive?

Some short cuts are good. A faster way to drive to work. A delicious meal in five minutes.

You’re not still using cluster analysis are you? Do you wear a seatbelt when you drive?Some short cuts are bad. Driving without a seat belt. Failing to proof read. Cluster analysis. Yep. You read it right. Cluster analysis.

Cluster analysis is the main statistical tool used in market segmentation.  Although it is often described as an “advanced” technique, it is actually a shortcut. It is a shortcut for something called latent class analysis (or, if you are a math’s geek, finite mixture modeling).

In the time before computers, when the dinosaurs roamed, some very clever people worked out the right way to use statistics to find groups in data. You may have even heard of one of the clever people, a guy called Karl Pearson, he of Pearson’s Chi-Square Test and Pearson’s Correlation.

The problem with the “right way” was that it was, at that time, impractical. So impractical, indeed, that for most of the standard problems that we use in marketing, it was not until the 1990s that it was practical to do it the “right way”, even with the fastest of fast computers and the biggest of big brains.

There are lots of different types of cluster analysis.  Some were invented as shortcuts for the correct way.  Some were invented by people who did not know there was a correct way. The most famous of the cluster analyses techniques, k-means, is a great case in point.  It achieves its shortcut by making the following assumptions:

  1. The data is numeric and each cluster contains data that follows a multivariate normal distribution.  (And, if you don’t know what that means, it’s hard for you to know if you should be making such an assumption…)
  2. Each cluster is the same size.
  3. There is no missing data.
  4. The clusters do not overlap.

As far as shortcuts goes, this is of the no seatbelt variety. Just like you can drive around without a seatbelt and not die, you can use cluster analysis and often still get to where you need to get.  And like with driving with no seat belt, more often than not you will not crash.  If you taking a quick trip to the shops, generally you will survive even without your seat belt. Similarly, if you are using k-means and you have some high quality rating scale data with few missing values generally it will all go OK.  For example:

  • Even with rating scales, such as ratings of agreement or importance, you can often get good results (even though this violates the assumption of multivariate normality).
  • Despite assuming, deep within the maths, that the clusters are the same sized, k-means will still find segments that differ in size.
  • When you do have missing data there are various little tricks you can use, such as first clustering people based on the available data and then allocating the remaining people to the most similar cluster (this is what the various SPSS products do).
  • The clusters will always be found to not overlap, regardless of the “truth”.

However, when the road is rocky or you are in a monster truck rally, the seat belt becomes more important. And, when your segmentation data is “rocky” or you have a monster survey, the cluster analysis shortcut is highly dangerous, and latent class analysis is vastly preferable.  For example:

  1. If you have high levels of missing data, such as where no person has complete data, you will get an error message from cluster analysis but latent class analysis will still work.
  2. If you have weird response patterns, such as some people having a tendency to give high ratings on everything, this can be automatically addressed using the more sophisticated latent class programs like Q and Latent Gold.
  3. If you have highly unusual data, such as choice data or max-diff, then latent class analysis can still form segments, whereas cluster analysis cannot be used at all (unless you take even more short-cuts, such as computing individual-level parameters).

In our own product, DataCracker, you will not even find a cluster analysis button. If you click Insert > Groups/Segments it will automatically create segments using  latent class analysis, working out the right type of assumptions to make for your data and automatically addressing the missing values.  Why not try it and see?


Image courtesy of mrpuen at