Sometime back in February, I had a conversation with a colleague on Twitter about the conditions that make a scientific result credible. He argued that one needs to replicate a scientific study’s results to trust it, while I felt a scientific result must be backed by a robust theory to be trusted1.

A few days ago, Berna Devezer, a specialist in these meta-scientific issues, posted a summary of what she and her team had learned, working on these replication issues. Why can reproducibility not be a necessary and sufficient indicator of the quality of scientific research, especially in humanities and social sciences, like psychology?

https://twitter.com/zerdeve/status/1688251046190333952

In my opinion, replicability is an essential criterion for assessing scientific research quality, but it can’t be the sole and definitive measure. It verifies the legitimacy of a study’s results, ensuring they are not random or due to measurement errors, publication biases, or manipulative alterations. However, replication is not always feasible or adequate or even desired for assessing the quality of scientific research.

Firstly, you can’t always replicate a study. Replication can’t be a fundamental and global standard independent of a scientific field’s specifics. In historical, archaeological, or human science research, recreating the exact conditions of observation or experimentation from the original study might be challenging or even impossible. For instance, the attempt to replicate the Bargh, Chen & Burrows (1996) behavioral effect experiment, by Doyen and his team (2012), is an apt example. According to Bargh’s theory, the activation of a stereotype should increase the probability of participants behaving unconsciously and automatically in line with that stereotype. In this theory, tested in their study, activating the elderly stereotype (associated with slowness) would cause students to walk slower as compared to a situation where the stereotype isn’t activated. This theory checks out in the following two experiments: the participants in the « Elderly People » condition took longer to cover the 9.75m distance from the experiment room to a floor marker than the control group. However, to replicate its effect, a series of auxiliary assumptions, which don’t form part of the theory but dictate its application conditions, must be satisfied. Does the elderly stereotype among French-speaking students in Belgium in 2010 correlate with that of English-speaking students in NYC in the 1990s? Is the French translation of the adjectives used by Bargh et al. (1996) the same as the English adjectives? Violating one of these auxiliary hypotheses can lead to the effect not being replicated, which neither affects nor invalidates the underlying theory. As van Bavel et al. (2016) state, even after adjusting for other methodological variables, contextual factors remain associated with replicability. Trying to replicate a study at a different time, place, or with a distinct sample can change what is otherwise considered « direct replications ». Moreover, disregarding these oft-overlooked auxiliary hypotheses by researchers, other factors such as the quality of the operationalization and measurement tools, can make replicating some studies challenging, without necessarily questioning or falsifying the underlying theories. So, if replication isn’t always possible, it can’t be considered a necessary condition, at least for social sciences.

Replication isn’t sufficient. Replication alone doesn’t warrant the results being true, relevant, or intriguing. For a study to be theoretically, methodologically, or ethically accurate, it’s not enough to replicate it. An effect can be replicated even without an actual theory. An example often cited in epistemology (Chalmers, 1987) and repeated by Trafimow and Earp (2016) is the phlogiston theory. Before Lavoisier, it was believed that combustion was due to the existence of an element named « phlogiston », despite the relation not being precisely articulated. Even though the theory lacked detail, researchers managed to demonstrate and replicate the existence of oxygen, nitrogen, and other major elements. Eventually, Lavoisier refuted the phlogiston theory based on increasingly precise measurements (for instance, some objects gained weight after supposedly losing phlogiston) and suggested a better theory. In that sense, replication suffers from the same problems as induction identified by Popper. Just as seeing a significant number of black crows doesn’t « confirm » my theory that all crows are black, an effect being replicated multiple times doesn’t necessarily mean the underlying theory is true.

Lastly, replicability isn’t always desirable. Devezer raises the risk that overemphasizing replication can induce uniformity in science and obliterate epistemological diversity. In fact, if replicability becomes the primary criterion to judge scientific research quality, it would likely favor quantitative and experimental approaches, sidelining more qualitative and ecological ones. If replication becomes a must for a theory’s credibility or validity, much energy and effort might be devoted to exactly replicating past effects, ignoring the generation of new ideas and theories and risking the degeneration of our research programs in the Lakatos sense. Moreover, the failure to replicate an effect may lead researchers to completely abandon this research line, rather than querying further the epistemological, theoretical, or methodological reasons for this replication failure.

In summary, while replicability is a crucial criterion, it is neither necessary nor sufficient for assessing scientific research quality. For it to be credible, a scientific study must accumulate enough diverse indicators consistent with that theory. The researcher’s job doesn’t lie in perpetually confirming hypotheses or theories, but rather in falsifying/refuting them to drive progress. Unfortunately, for the reasons mentioned above, a replication failure is not a refutation.

In closing, I’d like to revisit the last point Devezer brings up. I tend to agree with her on the impact of the incentives researchers face on the advancement of science. While science is a long-term process, the career-driven urgency to publish and be cited might steer researchers towards « easy », « innovative » research, focusing on discovering new (and replicable?) effects, rather than developing theories for understanding the world around us. After all, understanding the world takes time.

References:
Bargh, J. A., Chen, M., & Burrows, L. (1996). Automaticity of Social Behavior: Direct Effects of Trait Construct and Stereotype Activation on Action. Journal of Personality and Social Psychology, 71(2), 230–244. https://doi.org/10.1037/0022-3514.71.2.230

Doyen, S., Klein, O., Pichon, C.-L., & Cleeremans, A. (2012). Behavioral Priming: It’s All in the Mind, but Whose Mind? PLoS ONE, 7(1), e29081. https://doi.org/10.1371/journal.pone.0029081

Trafimow, D., & Earp, B. D. (2016). Badly specified theories are not responsible for the replication crisis in social psychology: Comment on Klein. Theory & Psychology, 26(4), 540–548. https://doi.org/10.1177/0959354316637136

Van Bavel, J. J., Mende-Siedlecki, P., Brady, W. J., & Reinero, D. A. (2016). Contextual sensitivity in scientific reproducibility. Proceedings of the National Academy of Sciences, 113(23), 6454–6459. https://doi.org/10.1073/pnas.1521897113

  1. I’ll likely revisit this point later in a blog post. ↩︎
Posted in

Laisser un commentaire