About a week ago, the Times published an op-ed to that recent Science paper claiming that many psychology papers are overstating the evidence. The op-ed was written by Lisa Barrett, a psychology professor at Northeastern. Her main argument is that failure to replicate results is one of the ways that science discovers new things, so psychology papers failing to replicate is not a problem.
While I think there is some truth there, I am not convinced by this argument. In fact, I find the counterexample of subatomic physics failing to replicate Newton’s laws to be incredibly disingenuous. The point that the psychology paper was making is attempts to replicate the result be performing what is more or less the exact same study failed to obtain the same result far more often than would be expected from the experimental uncertainties. One of the nice things about physics is that laboratory conditions can be controlled enough to actually perform (almost) identical experiments in (almost) identical conditions. Newton’s laws can be pretty easily verified by almost anyone. You can set up a set of springs and pulleys and measure oscillation frequencies or see how much weight it takes to lift some object attached to a string. There will be small deviations from air resistance, friction, etc., but you will still in the end verify Newton’s laws up to uncertainties from some reasonably known experimental conditions.
Basic classical physics doesn’t break down when you try to replicate an experiment. Rather, it breaks down when you try to extrapolate physical laws a bit too far. Classical mechanics describes things at macroscopic size traveling at speeds much lower than the speed of light. Deviations do exist but are almost always orders of magnitude too small to measure in a realistic experiment. Subatomic physics typically breaks both the assumptions of (1) macroscopic scale and (2) low velocity, so the equations governing our everyday life stop working. Furthermore, this example is also comparing experimental tests of scientific theories to comparisons of different experimental results. If the experiments are truly equivalent, they should obtain the same result regardless of what any theory says. For some measurements there doesn’t even need to be much of a theory at all.
Unfortunately, life sciences and social sciences are often much more difficult to replicate. People and living things aren’t particles; they’re not all identical, interchangeable quanta. Selection effects are going to be a very difficult problem to avoid or control, so I think it is natural to not require the same kind of precision that we can get in something like physics. It may even be true that many of the results in the paper failed to replicate because the populations between the original and new studies were not equivalent. Maybe more care needs to be taken when choosing test subjects in the replication studies, but this is still a problem unless someone can identify exactly how to do this. The whole point of scientific experiments is to generate data that can be used to extrapolate into broader scientific theories. If scientists perform a number of seemingly identical studies and get an equal number of incompatible results, then there isn’t really any way to inform theories other than to say that we don’t yet understand the experimental data.
Even the example of giving a rat a shock suggests that the author has a fundamental misunderstanding of what it means to replicate a study. She mentions that different results are obtained depending on the exact experimental procedure. This is not surprising. If you change an experiment different things can happen because you are no longer replicating the original experiment. You are instead testing to see if your earlier result still applies in different circumstances. This case is also probably more akin to what happens in psychology studies than any example in physics. There will always be slight differences between studies using human subjects, but if the variations aren’t understood (as far as I know, the people studying rats knew how to control their experimental conditions so this wouldn’t be a problem for their measurements), then any result can be rendered almost meaningless.
As I said at the beginning, there still are some worthwhile points here. Sometimes a failure to replicate really is a sign that something interesting could be happening. In neutrino physics, there are a series of “anomalies” of different experiments not matching other experiments. In dark matter detection, there are conflicting measurements of possible dark matter signals and dark matter exclusion curves. There have even been a number of high profile blunders in physics. These things happen. New measurements are being done to try to resolve these conflicts, whether they are experimental errors or real effects. The big problems that seem to have been identified in psychology are that (1) these new measurements often aren’t being done and (2) measurements are consistently falling on the side of having stronger statistical significance than they should. (1) means that errors can’t be easily identified and (2) suggests that published results are both biased and consistently underestimate systematic uncertainties.