More On Psychology Papers Overstating Evidence

Here’s a followup to my previous post on a new paper suggesting that evidence is often overstated in psychology papers. You can find the actual paper here. As stated in the news articles, the paper does not show that false results are being reported. Rather, it shows that the significance of measured effects seems to be systematically lower in replicated studies than in the originals. Furthermore, the significance of replicated studies also increases with the significance of the original result, so it seems likely that the original papers may actually be getting the correct result but are just overestimating the significance.

Suppose we make the most charitable assumption and assume that all the results are being done correctly and are being accurately reported. Can we then still come up with an explanation for why published results are generally more significant than reality? I think we can. Perhaps the most telling quote in the Times article is the one by Norbert Schwarz of the University of Southern California:

“There’s no doubt replication is important, but it’s often just an attack, a vigilante exercise.”

If his opinion is typical of researchers in psychology, then it means that much of the field is suspicious of studies attempting to replicate previous results. I would argue that a bias in favor of more significant results could easily be the result of the following two hypotheses:

  1. Replication is seen by many as unprofessional or rude
  2. Non-significant results are more difficult to publish, particularly in high-quality journals

The second point would be the principal cause of a bias, with the first preventing people from discovering problems.

I have heard that it is generally true in most fields that non-significant results are often viewed as unpublishable or unimportant. This is actually not the case in particle physics and astrophysics, where there are well-developed quantitative models that we can test, but in most fields such models don’t really and probably can’t exist. But, assuming that this is true, let’s suppose that 20 research groups independently perform identical studies searching for some effect that does not exist. On average, around one of those groups will find a result with a p-value less than 0.05, which is a common threshold for statistical significance. If all 20 of those groups submit their results to a journal, and if only significant results are published, then the one or so study showing statistical significance is published and none of the others are.

The literature thus shows that a significant effect was found in one study but shows nothing on the other 19 or so that found nothing. Replication is seen as unprofessional, so it would potentially harm researchers’ careers to try to publish a new study. Even if such as study were done, it might not get published due to many in the field viewing replication with suspicion.

So, the positive result remains the only result in the literature even though it is not significant if all studies are considered. The literature then represents a heavily biased sample of all studies being done even though there is not anything inherently wrong with any individual study. When studies are replicated, we would then expect to find significance to fall because we are suddenly considering a less biased sample of results than the journals.

There are also plenty of other ways to get distorted result based on these two assumptions. If people know that only results reaching a certain level of significance are published, then there is a huge incentive to somehow massage the data to reach that threshold. We end up with a system where everything is biased toward finding positive results, and the lack of replication creates a ratchet effect where once a positive result is found other results challenging that result are discouraged.

Physics is somewhat protected by this sort of problem because of a culture where null results are not just acceptable, but are actually preferred in many cases. In a field like dark matter, a non-null result is looked on with suspicion while limit contours are the standard result being used. This maybe causes a bias in the other direction, where positive results are discouraged, but it also means that the community won’t accept something new until there is overwhelming evidence in favor of it.

Report: Psychology Papers Often Overstate Evidence

It hasn’t been a great year for social science, with several high profile scandals involving faked or bad data. The New York Times has reported on a new article in Science looking into the reproducibility of 100 psychology articles. The paper concludes that 60 of these papers seemed to have issues. Fortunately, they did not uncover any fraudulent or outright false results, but it does seem like the papers were consistently overstating the evidence for their conclusions.

I haven’t had time to read the paper yet, but there are a few reasons why I think this could occur, so I might write a follow up post or two about this. The Times article does mention that the studies replicating the 100 papers were required to have the input of the original authors to make sure that the studies were as similar as possible.

I think it is important to note that the physical sciences are not immune to pathological results. Sometimes bad results happen even if everything is done correctly. Statistics typically allows this, so given enough perfectly designed and executed studies, some are bound to be wrong. Sometimes the authors simply forget to account for systematic uncertainty or maybe there is a problem with the theoretical model being tested. That is why we often see “3-sigma” effects in high energy physics disappear into noise. While “3-sigma” seemingly implies that there is a tiny chance of the result being wrong, there are still a lot of caveats. In the case of something like social sciences, the kind of precision found in much of physics is simply impossible to achieve, so the chance of getting a false result is likely to be much higher.

Hopefully someone will vet this paper to see if it, too, has bad methodology.

It’s National Left-Hander’s Day

I guess today was National Left-Hander’s Day for some reason. As someone who is left-handed I guess I don’t particularly care if there is a day or not. At this point our primary plights are getting ink all over our hands due to the left-to-right writing scheme in most western countries and dealing scissors that don’t work well if used by left-handed people..

Google Renames Itself

Google has announced that it is restructuring and renaming itself. What used to known as Google (the company) will now be Alphabet. Alphabet will just be an umbrella for all the various pieces of what is currently Google. The Google name isn’t going anywhere though. It sounds like a lot of the core pieces of the company such as the Google search engine will retain their original names.

Today in History

Today marks the 70th anniversary of the US dropping a nuclear bomb on the city of Nagasaki in southern Japan at the end of the Second World War. Japan surrendered only a few days later, with the development of nuclear weapons, the success of the US blockade of the Japanese home islands, and the entry of the USSR into the Asian theater making Japan’s position completely hopeless. The bombing on Nagasaki is also very important because it used a plutonium bomb rather than the same uranium bomb design used to attack Hiroshima. Plutonium bombs are reportedly considerably more difficult to design but need less than 10 kg of plutonium. Thus, the development of plutonium bombs like Fat Man would have helped start research into the minimization of nuclear weapons (i.e. packing as much explosive power into as small a package as possible). Fortunately for the rest of the world, this also marks the last time that a nuclear weapon was actually used during a war. Every other nuclear explosion has been for testing and research purposes/