When we talk about science, surely the word “tradition” is not one of the first that comes to mind. And yet researchers are not exempt from certain uses and customs without much more background than history. This is the classic case, the 95% level of significance.
Story of a disagreement.
95 is not a magic number that makes a measurement correct or not. So why are researchers around the world using them as the frontier of truth? The answer seems to be out of habit. A hundred-year-old custom, its origin: a copyright dispute.
As Brent Goldfarb and Andrew King tell in a 2015 article, you have to go back to the 20s of the last century. The protagonists of the story are two acquaintances of those who have studied statistics: Karl Pearson and Ronald Fisher. The first had created some reference tables for the calculation of p-values, for which he received copyrights from those who wanted to include them in their statistics books. A copyright that Fisher simply did not want to pay.
So he decided to create a method to calculate significance based on only two discrete values, the p-values 0.05 and 0.01. As Goldfarb and King write in their article “a fair interpretation of this story is that we use p-values at least partially because a statistic of [la década de 1920] he was afraid that sharing his work would undermine his income.”
What exactly is the p-value?
The p-values estimate the chance that we will make a specific type of error in calculating a statistical parameter, such as a mean. If the real measurement, the population measurement, is very far from the estimate, which we make based on a limited sample of the entire population, the latter is not very useful.
We can give an example with the evaluation of a medical treatment. We want to know if the patients who have received it are on average better than those who have not received it. In this case we are interested in knowing that the average improvement of the patients is greater than zero. If our estimate is positive but the actual measurement is zero, the study is of no use to us (perhaps to offer a useless drug to patients).
The calculation. In this case, what the experimenter is doing is calculating the probability that the true value of the calculated mean is greater than zero. We would call this probability the p-value, and it is a key measure in science. Despite this, it is not always used.
The calculation of this p-value is based on statistical distributions. Each statistical measure that is constructed from a model has an associated probability distribution. This is usually bell-shaped, high in the center and falling to the sides forming curves. From this distribution we can calculate the probability that the actual measurement is in a specific neighborhood around our estimate.
A significance of 95%.
Sometimes in addition to and sometimes instead of this p-value, discrete levels of significance are used, usually 95% but also sometimes 99% and 99.9%. This translates into a “significant” measurement when the p-values are less than 0.05; 0.01; and 0.001 respectively. People also talk about being able to “reject, or not, the hypothesis” that the measurement is wrong to the point of being useless.
Why this matters a century later.
Fisher himself would later acknowledge, Goldfarb and King go on to explain, that the use of p-values was better than his “binary method.” To the point of admitting, they quote, “no scientific worker has set a level of significance at which, from year to year and in all circumstances, reject hypotheses [nulas]; he will prefer to leave the decision in each particular case based on his evidence and ideas”. Context is important when interpreting statistics.
The “binary” idea of significance can give false confidence in results based on uncertainty. Absolute certainty does not exist, to the point that even the probability of error is itself an estimate.
Goldfarb and King’s article is not a treatise on the history of statistics, but rather a critique of marketing literature. They estimate that between 24 and 40% of the studies they analyze would not generate the same results if they were repeated. The possibility of repeating an experiment is vital in science. With this, theories can be confirmed or rejected with greater certainty, ruling out possible erroneous studies. After all, a p-value of 0.05 implies, in principle, a 5% chance of error.
bias thing. The biases that science faces are many and varied. The possibility of errors in a large number of scientific papers has been known for more than a decade and has been the subject of debate. The need to publish publish or perishand the reluctance of journals to publish confirmatory studies are also the cause of another bias, publication bias (publication bias), also very feared by those who analyze the scientific literature.
The scientific method is today the best tool we have, but like any tool it is not perfect, and needs some occasional fix. Getting rid of the burden of some traditions can be a good starting point.
George is Digismak’s reported cum editor with 13 years of experience in Journalism