This is the arresting title of a 2005 paper by a Harvard medic John Ioannidis. It is especially arresting for a research scientist like me. And he really does mean it, although I should say that he is talking mainly about medical research, not physics.
Part of his argument relates to the tendency of biomedical scientists (Ioannidis’ field) tendency to use what are called tests of statistical significance. The argument commonly used in many biomedical papers is essentially as follows. They say “We were interested in the effect of X on Y, so we made 10 measurements of Y, changed X and then made another 10 measurements of Y. The results of the two sets of 10 measurements of Y were significantly different, so X affects Y.” Where by ‘significantly different’ they mean that a statistical quantity called a p value is small (less than 0.05 = 1 in 20).
The point is that (especially when you are measuring messy things like living organisms (e.g., us)) all the 20 measurements of Y have some noise in them. So if you do 10 measurements, and then another 10, due to this noise you will always get different results for the two sets of measurements. The trick is to distinguish between differences due to random noise and differences due to the change in X.
Often this is done by estimating the probability that, just by chance you would get the two sets of 10 measurements, even if changing X had no effect. This is what the p value is. Then if p is say 1 in 100, the scientists conclude: “The p value is 0.01 and so there is only a one in a hundred chance that our observations could arise just by chance. So the effect of changing X must be real.”
However, there are problems with this reasoning. One is highlighted by Ioannidis in his paper. It is that most important questions have a number of teams of scientists working on them. So say 10 teams are working on understanding Y. Perhaps each of them is trying several things to understand what causes Y. If each team is trying 10 different candidates for X: X1, X,2,…, X10, then between the 10 teams they are doing 100 experiments to determine a possible X that causes Y.
If none of the 10 candidate X’s has any effect then in the 100 experiments, just by chance there should be about 1 set of Y measurements before and after the X is varied that are so different that would only arise one time in hundred. Then if the 99 failed attempts are ignored, easy to do especially as they occurred in 10 different labs, and just this ‘success’ published then the false result, and accompanying misleading statistical test, will be published.
This is a serious problem in biomedical research. As Ioannidis notes in his paper, attempts to replicate some high profile results have failed, presumably because the results are wrong. Scientists, and the field as a whole, need to improve on their error estimation.