Sir Ronald Aylmer Fisher (1890-1962) was one of the greatest statisticians of all time. However, Fisher was also stubborn, belligerent, and a eugenicist. When it comes to shocking remarks, one does not need to dig deep:
- In a dissenting opinion on the 1950 UNESCO report “The race question”, Fisher argued that “Available scientific knowledge provides a firm basis for believing that the groups of mankind differ in their innate capacity for intellectual and emotional development”.
- Fisher strongly, repeatedly, and persistently opposed the conclusion that smoking is a cause of lung cancer.
- Fisher felt that “The theory of inverse probability [i.e., Bayesian statistics] is founded upon an error, and must be wholly rejected.” (for details see Aldrich, 2008).
- In The Design of Experiments Fisher argued that “it should be noted that the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.” (1935, p. 16). This confession should be shocking, because it means that we cannot quantify evidence for a scientific law. As Jeffreys (1961, p. 377) pointed out, in Fisher’s procedure the law (i.e, the null hypothesis) “is merely something set up like a coconut to stand until it is hit”.
The next section discusses another shocking statement, one that has been conveniently forgotten and flies in the face of current statistical practice.
The Lady Tasting Tea
Chapter 2 of The Design of Experiments is titled “The Principles of Experimentation Illustrated by a Psycho-Physical Experiment”. Here Fisher introduces the famous case of the lady tasting tea:
“A lady declares that by tasting a cup of tea made with milk she can discriminate whether the milk or the tea infusion was first added to the cup. We will consider the problem of designing an experiment by means of which this assertion can be tested (…)
Our experiment consists in mixing eight cups of tea, four in one way and four in the other, and presenting them to the subject for judgment in a random order.(…)
Her task is to divide the 8 cups into two sets of 4, agreeing, if possible, with the treatments received.” (Fisher, 1935, p. 11)
We have already seen above that a nonsignificant result (usually p>.05) cannot be used to quantify support in favor of the null hypothesis that the lady’s discriminatory ability is illusory. But what of a significant result (usually p<.05)? Surely, when we reject the null hypothesis we can now embrace the hypothesis that the lady does have discriminatory abilities? But Fisher emphatically denies this:
“It might be argued that if an experiment can disprove the hypothesis that the subject possesses no sensory discrimination between two different sorts of object, it must therefore be able to prove the opposite hypothesis, that she can make some such discrimination. But this last hypothesis, however reasonable or true it may be, is ineligible as a null hypothesis to be tested by experiment, because it is inexact. [italics ours] If it were asserted that the subject would never be wrong in her judgments we should again have an exact hypothesis, and it is easy to see that this hypothesis could be disproved by a single failure, but could never be proved by any finite amount of experimentation. It is evident that the null hypothesis must be exact, that is free from vagueness and ambiguity, because it must supply the basis of the “problem of distribution,” of which the test of significance is the solution.” (Fisher, 1935, p. 16)
Here we stand. It is common knowledge that a nonsignificant p-value cannot be used to support the null-hypothesis (according to Fisher). What is not generally known is that, according to Fisher, a significant p-value does not warrant acceptance of the alternative hypothesis. In other words, the only legitimate inference is that p<.05 (say) undercuts the null hypothesis. This does NOT mean that the result favors the alternative hypothesis! Not only is this counterintuitive, we believe that it violently conflicts with the way in which practitioners interpret their p-values. The purpose of most researchers is to make a positive claim (“there is evidence for the presence of X”); we speculate that most researchers believe that such claims can be made from significant p-values, that is, “p<.05, there is evidence against the absence of X” will quickly be interpret as “p<.05, there is evidence for the presence of X”.
Shocking.
References
Aldrich, J. (2008). R. A. Fisher on Bayes and Bayes’ theorem. Bayesian Analysis, 3, 161-170.
Fisher, R. A. (1935). The design of experiments. Edinburgh: Oliver & Boyd.
About The Author
Eric-Jan Wagenmakers
Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.
Johnny van Doorn
Johnny van Doorn is a PhD candidate at the Psychological Methods department of the University of Amsterdam.