Thursday, November 3, 2011

Cargo Cult Statistics

One of the nice things about working in a world-class ecology group is the statistical rigor with which ecologists analyse their results. Unfortunately, this rigor is often missing in computational intelligence. Although I touched on some of these issues in a previous post on Minimum requirements for computational intelligence papers, I recently read an article (that shall remain anonymous) that actually made me groan. While I am starting to notice more papers with repeated trials, and even investigating several parameters, the analysis of these results leave a lot to be desired.

Sometimes it is enough to simply list the mean and standard deviation of your accuracy measures. By itself, the mean is useful as a statistic that represents the population of accuracies that the algorithm yielded. The standard deviation is also good as a measure of spread of the values. But if your standard deviation is large, that needs some comment in the paper on why the algorithm is so variable? This is even more important when comparing different algorithms. An author might for example like to say that a neural network trained with evolutionary programming is better than logistic regression for their application, but if they are seeing a coefficient of variation of more than 60% then that implies that the algorithm is giving highly variable or even inconsistent results. To say that these results show that ANN are better than regression, without any statistical tests for significant differences is simply nonsense.

Even if you do do such tests, you need to make sure that you are using the correct tests. What is the distribution of your results? Are they normally distributed? If they are not normally distributed, then you can't use simple parametric tests of significant differences like t-tests. If you are comparing several groups of numbers then a n-way ANOVA is more appropriate than performing n t-tests. These kinds of comparisons, of several groups of numbers, are very common in computational intelligence (the authors are comparing different algorithms over several data sets, or with different parameterisations) but I can't remember ever seeing a paper that mentioned ANOVA (if you can prove me wrong, please do so in the comments).

I call this kind of shallow statistical analysis Cargo Cult Statistics.The term is inspired by Richard Feynman's famous speech about Cargo Cult Science. In this case, it means that while it looks like the authors are doing a statistical analysis of their results (they are calculating the means and standard deviations) it isn't really so, because they are missing out a huge amount of analysis that might actually tell them something useful about their results.

Now, I'm still learning about statistics (but, I'm still learning about everything, and will be until the day I die). But at least I know to ask someone with a better knowledge of statistics than me for advice on how to analyse my results, and I think it makes my papers much better.