Science Caught in Reproducibility Crisis

Written by Monya Baker

A semantic confusion is clouding one of the most talked-about issues in research. Scientists agree that there is a crisis in reproducibility, but they can’t agree on what ‘reproducibility’ means.

psi 7

The muddle is hampering communication about the problem and efforts to address it, a meeting last week on improving the reproducibility of preclinical research was told.

Most scientists — at least, those in biomedical research — have the idea that reproducible findings are those that give generally consistent results across slight variations in experimental set-up, says Ferric Fang, a microbiologist at the University of Washington in Seattle. “Reproduction is taking the idea of a scientific project and showing that it is robust enough to survive various sorts of analysis,” he says. That is, that it supports an expectation, for example that ‘reproducible’ preclinical results are those worth taking forward to clinical trials.

But Fang adds that other definitions of the term are in common use. A second one is much narrower: a finding is reproducible if another researcher gets the same results when doing exactly the same experiment. On this interpretation, a fragile experiment that works under certain conditions in the laboratory, but not in other contexts, is still ‘reproducible’. And a third definition holds that a reproducible experiment is merely one that has been published with a sufficiently complete description — such as detailed methods — for another scientist to repeat it.

All these definitions point to the various problems that plague research. Scientists don’t want experiments that are poorly documented or unreliable, or that don’t give similar findings when the methods are slightly tweaked. But all of these issues have at times been framed as issues of reproducibility. “Reproducibility is shorthand for a lot of problems,” Jon Lorsch, head of the National Institute of General Medical Sciences, told attendees at the meeting, which was held in Bethesda, Maryland.

Without a shared understanding of the term, it can be unclear how scientists should respond when told that someone is “unable to reproduce” results in their paper, adds Ulrich Dirnagl, a stroke researcher at the Charité Medical University in Berlin. Challenges to research should be more clearly explained, he says.

Expanded terms

Instead of advocating for a common definition, several scientific leaders are calling for an expanded set of terms. Earlier this month, researchers at the Meta-Research Innovation Center at Stanford in California proposed three1: methods reproducibility, results reproducibility and inferential reproducibility, mapping roughly onto the three concepts described by Fang.

But the term can be split according to other kinds of distinctions. Victoria Stodden, a data scientist at the University of Illinois at Urbana-Champaign, makes the distinction between ‘empirical’ reproducibility (supplying all the details necessary for someone to physically repeat and verify an experiment) and ‘computational’ and ‘statistical’ reproducibility, which refer to the resources needed to redo computational and analytical findings. Achieving each type of reproducibility calls for different remedies, she says.

Last year, a white paper by the American Society for Cell Biology in Bethesda dismissed reproducibility as a catch-all term, and introduced a four-tier definition instead. According to this paper, “analytic replication” refers to attempts to reproduce results by reanalysing original data; “direct replication” refers to efforts to use the same conditions, materials and methods as an original experiment; “systematic replication” describes efforts to produce the same findings using different experimental conditions (such as trying an experiment in a different cell line or mouse strain), and “conceptual replication”, which refers to attempts to demonstrate the general validity of a concept, perhaps even using different organisms.

Agreeing on ways to assess reproducibility can be fraught with complications — even within a research field. Last year, the Reproducibility Project: Psychology, which conducted 100 replication studies to assess psychology publications, used five indicators of whether a study had been successfully replicated2. Earlier this year, a paper that aimed to distil best practices for neuroimaging outlined 10 levels of reproducibility in such experiments across three categories, called ‘measurement stability’, ‘analytical stability’ and ‘generalizability’. They differed according to whether, for example, an attempted repeat experiment used the same scanners, analysis methods, subject population, stimulus type and so on across 11 variables.

Ultimately, each discipline may need to come to its own definition, says Lee Ellis, a surgical oncologist at the University of Texas MD Anderson Cancer Center in Houston. “I don’t think we can define reproducibility across the board,” he says.

Arguments over such distinctions are not just fruitless wordplay, says Dirnagl. An appreciation of the nuances of reproducibility could help researchers to communicate when they can’t reach common ground on apparently differing findings. Some of science’s most enlightening results arise when an effect is partially reproducible — seen under some conditions but not others. “Science advances through differential reproduction,” he says.

Read more at:

Comments (1)

  • Avatar

    Jerry L Krause


    Hi Monya,

    In his commencement address at Caltech in 1974, which was published in his book—“Surely You’re Joking Mr. Feynman!”—Feynman told this story and gave a warning: “Other kinds of errors are more characteristic of poor science. When I was at Cornell, I often talked to the people in the psychology department. One of the students told me she wanted to do an experiment that went something like this—it had been found by others that under certain circumstances, X, rats did something, A. She was curious as to whether if she changed the circumstances to Y, they would still do A. So her proposal was to do the experiment under circumstances Y and see if they still did A.

    “I explained to her that it was necessary first to repeat in her laboratory the experiment of the other person—to do it under condition X to see if she could also get result A, and then change to Y and see if A changed. Then she would know that the real difference was the thing she thought she had under control.

    “She was delighted with this new idea, and went to her professor. And his reply was, no you cannot do that, because the experiment had already been done and you would be wasting time. This was about 1947 or so, and it seems to have been the general policy then to not try to repeat psychological experiments, but only to change the conditions and see what happens.

    “Nowadays there’s a certain danger of the same thing happening, even in the famous field of physics. I was shocked to hear of an experiment done at the big accelerator at the National Accelerator Laboratory, where a person used deuterium. In order to compare his heavy hydrogen results to what might happen with light hydrogen, he had to use data from someone else’s experiment on light hydrogen, which was done on different apparatus. When asked why, he said it was because he couldn’t get time on the program (because there’s so little time and it’s such expensive apparatus) to do the experiment with light hydrogen with this apparatus because there wouldn’t be any new result. And so the men in charge of programs at NAL are so anxious for new results, in order to get more money to keep the thing going for public relations purposes, they are destroying—possibly—the value of the experiments themselves, which is the whole purpose of the thing. It is often hard for the experimenters there to complete their work as their scientific integrity demands.”

    Have a good day, Jerry

Comments are closed