If you look hard enough, surveys are filled with “insights”. There are always differences that could be interesting. Men may like your product more than women. Older people may buy your product less frequently. Awareness may lift when you rejig your website.

The challenge with analyzing surveys is that many things that look like insights are probably wrong. Statistical testing has been developed for just this problem. The purpose of a “stat test” is to work out if a result in a survey that seems to be interesting is likely to be a one-off fluke, or, is a true insight.

Consider a survey where you find that 30% of women like your product and 40% of men like your product. Perhaps the 10% higher score amongst the men signifies that they are more interested. Or, it may be just a random fluke caused by how people were selected to participate in your survey, and if you did the survey again perhaps you would find that women like your product more than men. (Statisticians refer to this idea as *sampling error.)*

Statisticians have developed hundreds of different statistical tests which are designed precisely to solve this problem. The key output of these tests is something called a *p*-value, which is a number between 0 and 1. Larger *p-*values mean that the results are more likely to be flukes. For example, if we found with our comparison of 30% to 40% that the *p-*value was 0.20, this would mean that *if *it is true that men and women have the same interest in our product, there is only a 0.20 probability or 20% or 1 in 5 chance that we would have observed a difference of 10% or more (i.e., 40% – 30%).

A slightly tricky thing about significance is that the logic is back-to-front. A stat test cannot tell us if our result is actually true or not. All it can tell us is that if there was in reality no real difference, how likely it is that we would have observed a result like the one that we did actually observe.

Putting aside the mind-boggling concept of *p-*values, the practical implications that most people take out of stat testing are as follows:

- When we compare differences in results in surveys, we can use stat tests to check if the differences are likely to flukes or to reflect real differences in the world.
- Where the
*p-*values are small this means that the results are more likely to be valid (i.e., not flukes). Alternatively, some people use the term*confidence*, and seek results with high levels of confidence, where the confidence level is just (1 – p) * 100%. For example, a*p-*value of 0.10 is the same as a confidence level of 90%.

- You should do your testing using a small
*p-*value. The standard practice is that you choose a particular threshold level and conclude that results are “significant” if the computed*p-*values are smaller than this threshold. For example, most scientific research requires a*p-*value of less than 0.05, and some medical trials require a*p-*value of much, much less than this. Some market research companies use a threshold value of 0.10.

As things go, such a pragmatic understanding of stat testing is OK. And, when I write “OK” I mean that it is a lot better than nothing, but it is not really good practice. The problem with this pragmatic and widespread approach to stat testing is that it leads to lots of *false discoveries.*

A false discovery is a conclusion that is false. Imagine that the truth is that if we talked to everybody in the world we would discover that men and women have the same level of preference for our product. But, we do a survey of a few hundred people and “discover” that men have a 10% higher level of liking than women. Such a “discovery” is a false discovery; that is, the what we think we have discovered turns out not to be true.

Now we come to the rub. If you do stat testing in the standard pragmatic way, it is possible that many of your discoveries, and perhaps even most of your discoveries, are actually false discoveries and will mislead you about how your market works. There are two ways of protecting against this. One is to use a lot of commonsense. If a result seems a bit weird it usually is a false discovery. The other approach is to replace the traditional method for statistical testing with methods that explicitly focus on reducing the number of false discoveries. For example, some of the more modern survey analysis programs, such as Q and Google Consumer Surveys, automatically correct their results to ensure that no more than 5% of the results that are reported as being significant are false discoveries.

In our own product, DataCracker, we use the traditional pragmatic approach as the default. However, you can have it instead ensure that no more than five percent of results are false discoveries by clicking on **Highlight Results **and selecting **Highlight few results.**

If you want to know how these this is done, please take a look http://surveyanalysis.org/wiki/Multiple_Comparisons_(Post_Hoc_Testing).

*Image courtesy of Stuart Miles / FreeDigitalPhotos.net*