More On Psychology Papers Overstating Evidence

Here’s a followup to my previous post on a new paper suggesting that evidence is often overstated in psychology papers. You can find the actual paper here. As stated in the news articles, the paper does not show that false results are being reported. Rather, it shows that the significance of measured effects seems to be systematically lower in replicated studies than in the originals. Furthermore, the significance of replicated studies also increases with the significance of the original result, so it seems likely that the original papers may actually be getting the correct result but are just overestimating the significance.

Suppose we make the most charitable assumption and assume that all the results are being done correctly and are being accurately reported. Can we then still come up with an explanation for why published results are generally more significant than reality? I think we can. Perhaps the most telling quote in the Times article is the one by Norbert Schwarz of the University of Southern California:

“There’s no doubt replication is important, but it’s often just an attack, a vigilante exercise.”

If his opinion is typical of researchers in psychology, then it means that much of the field is suspicious of studies attempting to replicate previous results. I would argue that a bias in favor of more significant results could easily be the result of the following two hypotheses:

  1. Replication is seen by many as unprofessional or rude
  2. Non-significant results are more difficult to publish, particularly in high-quality journals

The second point would be the principal cause of a bias, with the first preventing people from discovering problems.

I have heard that it is generally true in most fields that non-significant results are often viewed as unpublishable or unimportant. This is actually not the case in particle physics and astrophysics, where there are well-developed quantitative models that we can test, but in most fields such models don’t really and probably can’t exist. But, assuming that this is true, let’s suppose that 20 research groups independently perform identical studies searching for some effect that does not exist. On average, around one of those groups will find a result with a p-value less than 0.05, which is a common threshold for statistical significance. If all 20 of those groups submit their results to a journal, and if only significant results are published, then the one or so study showing statistical significance is published and none of the others are.

The literature thus shows that a significant effect was found in one study but shows nothing on the other 19 or so that found nothing. Replication is seen as unprofessional, so it would potentially harm researchers’ careers to try to publish a new study. Even if such as study were done, it might not get published due to many in the field viewing replication with suspicion.

So, the positive result remains the only result in the literature even though it is not significant if all studies are considered. The literature then represents a heavily biased sample of all studies being done even though there is not anything inherently wrong with any individual study. When studies are replicated, we would then expect to find significance to fall because we are suddenly considering a less biased sample of results than the journals.

There are also plenty of other ways to get distorted result based on these two assumptions. If people know that only results reaching a certain level of significance are published, then there is a huge incentive to somehow massage the data to reach that threshold. We end up with a system where everything is biased toward finding positive results, and the lack of replication creates a ratchet effect where once a positive result is found other results challenging that result are discouraged.

Physics is somewhat protected by this sort of problem because of a culture where null results are not just acceptable, but are actually preferred in many cases. In a field like dark matter, a non-null result is looked on with suspicion while limit contours are the standard result being used. This maybe causes a bias in the other direction, where positive results are discouraged, but it also means that the community won’t accept something new until there is overwhelming evidence in favor of it.