Predicting the replicability of social science lab experiments

Autoři: Adam Altmejd aff001;  Anna Dreber aff001;  Eskil Forsell aff001;  Juergen Huber aff003;  Taisuke Imai aff004;  Magnus Johannesson aff001;  Michael Kirchler aff003;  Gideon Nave aff005;  Colin Camerer aff006
Působiště autorů: Department of Economics, Stockholm School of Economics, Stockholm, Sweden aff001;  SOFI, Stockholm University, Stockholm, Sweden aff002;  Universität Innsbruck, Innsbruck, Austria aff003;  LMU Munich, Munich, Germany aff004;  The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America aff005;  California Institute of Technology, Pasadena, California, United States of America aff006
Vyšlo v časopise: PLoS ONE 14(12)
Kategorie: Research Article


We measure how accurately replication of experimental results can be predicted by black-box statistical models. With data from four large-scale replication projects in experimental psychology and economics, and techniques from machine learning, we train predictive models and study which variables drive predictable replication. The models predicts binary replication with a cross-validated accuracy rate of 70% (AUC of 0.77) and estimates of relative effect sizes with a Spearman ρ of 0.38. The accuracy level is similar to market-aggregated beliefs of peer scientists [1, 2]. The predictive power is validated in a pre-registered out of sample test of the outcome of [3], where 71% (AUC of 0.73) of replications are predicted correctly and effect size correlations amount to ρ = 0.25. Basic features such as the sample and effect sizes in original papers, and whether reported effects are single-variable main effects or two-variable interactions, are predictive of successful replication. The models presented in this paper are simple tools to produce cheap, prognostic replicability metrics. These models could be useful in institutionalizing the process of evaluation of new findings and guiding resources to those direct replications that are likely to be most informative.

Klíčová slova:

Algorithms – Experimental economics – Machine learning – Machine learning algorithms – Replication studies – Scientists – Experimental psychology


1. Dreber A, Pfeiffer T, Almenberg J, Isaksson S, Wilson B, Chen Y, et al. Using Prediction Markets to Estimate the Reproducibility of Scientific Research. Proceedings of the National Academy of Sciences. 2015;112(50):15343–15347. doi: 10.1073/pnas.1516179112

2. Camerer CF, Dreber A, Forsell E, Ho TH, Huber J, Johannesson M, et al. Evaluating Replicability of Laboratory Experiments in Economics. Science. 2016;351(6280):1433–1436. doi: 10.1126/science.aaf0918 26940865

3. Camerer CF, Dreber A, Holzmeister F, Ho TH, Huber J, Johannesson M, et al. Evaluating the Replicability of Social Science Experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour. 2018;2(9):637–644. doi: 10.1038/s41562-018-0399-z 31346273

4. Simonsohn U, Nelson LD, Simmons JP. P-Curve: A Key to the File-Drawer. Journal of Experimental Psychology: General. 2014;143(2):534–547. doi: 10.1037/a0033242

5. Simmons JP, Nelson LD, Simonsohn U. False-Positive Psychology Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science. 2011;22(11):1359–1366. doi: 10.1177/0956797611417632 22006061

6. Koch C, Jones A. Big Science, Team Science, and Open Science for Neuroscience. Neuron. 2016;92(3):612–616. doi: 10.1016/j.neuron.2016.10.019 27810003

7. Open Science Collaboration. Estimating the Reproducibility of Psychological Science. Science. 2015;349 (6251).

8. Bavel JJV, Mende-Siedlecki P, Brady WJ, Reinero DA. Contextual Sensitivity in Scientific Reproducibility. Proceedings of the National Academy of Sciences. 2016;113(23):6454–6459. doi: 10.1073/pnas.1521897113

9. Ioannidis JPA. Why Most Published Research Findings Are False. PLOS Medicine. 2005;2(8):e124. doi: 10.1371/journal.pmed.0020124 16060722

10. Lindsay DS. Replication in Psychological Science. Psychological Science. 2015;26(12):1827–1832. doi: 10.1177/0956797615616374 26553013

11. Ioannidis JPA, Munafò MR, Fusar-Poli P, Nosek BA, David SP. Publication and Other Reporting Biases in Cognitive Sciences: Detection, Prevalence, and Prevention. Trends in Cognitive Sciences. 2014;18(5):235–241. doi: 10.1016/j.tics.2014.02.010 24656991

12. Nosek BA, Alter G, Banks GC, Borsboom D, Bowman SD, Breckler SJ, et al. Promoting an Open Research Culture. Science. 2015;348(6242):1422–1425. doi: 10.1126/science.aab2374 26113702

13. Ioannidis JP, Ntzani EE, Trikalinos TA, Contopoulos-Ioannidis DG. Replication Validity of Genetic Association Studies. Nature Genetics. 2001;29(3):306–309. doi: 10.1038/ng749 11600885

14. Martinson BC, Anderson MS, de Vries R. Scientists Behaving Badly. Nature. 2005;435:737–738. doi: 10.1038/435737a 15944677

15. Silberzahn R, Uhlmann EL, Martin DP, Anselmi P, Aust F, Awtrey E, et al. Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results. Advances in Methods and Practices in Psychological Science. 2018;1(3):337–356. doi: 10.1177/2515245917747646

16. De Vries R, Anderson MS, Martinson BC. Normal Misbehavior: Scientists Talk about the Ethics of Research. Journal of Empirical Research on Human Research Ethics. 2006;1(1):43–50. doi: 10.1525/jer.2006.1.1.43 16810336

17. Munafò MR, Nosek BA, Bishop DVM, Button KS, Chambers CD, Percie du Sert N, et al. A Manifesto for Reproducible Science. Nature Human Behaviour. 2017;1(1):0021. doi: 10.1038/s41562-016-0021

18. O’Boyle EH, Banks GC, Gonzalez-Mulé E. The Chrysalis Effect: How Ugly Initial Results Metamorphosize Into Beautiful Articles. Journal of Management. 2017;43(2):376–399.

19. Begley C Glenn, Ioannidis John P A. Reproducibility in Science. Circulation Research. 2015;116(1):116–126.

20. Ioannidis JPA, Tarone R, McLaughlin JK. The False-Positive to False-Negative Ratio in Epidemiologic Studies. Epidemiology. 2011;22(4):450–456. doi: 10.1097/EDE.0b013e31821b506e 21490505

21. Simons DJ. The Value of Direct Replication. Perspectives on Psychological Science. 2014;9(1):76–80. doi: 10.1177/1745691613514755 26173243

22. Rand DG, Greene JD, Nowak MA. Spontaneous Giving and Calculated Greed. Nature. 2012;489(7416):427–430. doi: 10.1038/nature11467 22996558

23. Tinghög G, Andersson D, Bonn C, Böttiger H, Josephson C, Lundgren G, et al. Intuition and Cooperation Reconsidered. Nature. 2013;498(7452):E1–E2. doi: 10.1038/nature12194 23739429

24. Bouwmeester S, Verkoeijen PPJL, Aczel B, Barbosa F, Bègue L, Brañas-Garza P, et al. Registered Replication Report: Rand, Greene, and Nowak (2012). Perspectives on Psychological Science. 2017;12(3):527–542. doi: 10.1177/1745691617693624 28475467

25. Rand DG, Greene JD, Nowak MA. Rand et al. Reply. Nature. 2013;498(7452):E2–E3. doi: 10.1038/nature12195

26. Rand DG. Reflections on the Time-Pressure Cooperation Registered Replication Report. Perspectives on Psychological Science. 2017;12(3):543–547. doi: 10.1177/1745691617693625 28544864

27. Nuijten MB, Hartgerink CHJ, van Assen MALM, Epskamp S, Wicherts JM. The Prevalence of Statistical Reporting Errors in Psychology (1985–2013). Behavior Research Methods. 2016;48(4):1205–1226. doi: 10.3758/s13428-015-0664-2 26497820

28. Klein RA, Ratliff KA, Vianello M, Adams RB, Bahník Š, Bernstein MJ, et al. Investigating Variation in Replicability: A “Many Labs” Replication Project. Social Psychology. 2014;45(3):142–152. doi: 10.1027/1864-9335/a000178

29. Ebersole CR, Atherton OE, Belanger AL, Skulborstad HM, Allen JM, Banks JB, et al. Many Labs 3: Evaluating Participant Pool Quality across the Academic Semester via Replication. Journal of Experimental Social Psychology. 2016;67:68–82. doi: 10.1016/j.jesp.2015.10.012

30. Yarkoni T, Westfall J. Choosing Prediction over Explanation in Psychology: Lessons from Machine Learning. Perspectives in Psychological Science. 2017;12(6):1100–1122. doi: 10.1177/1745691617693393

31. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer Series in Statistics. Springer; 2009.

32. Nave G, Minxha J, Greenberg DM, Kosinski M, Stillwell D, Rentfrow J. Musical Preferences Predict Personality: Evidence From Active Listening and Facebook Likes. Psychological Science. 2018;29(7):1145–1158. doi: 10.1177/0956797618761659 29587129

33. Camerer CF, Nave G, Smith A. Dynamic Unstructured Bargaining with Private Information: Theory, Experiment, and Outcome Prediction via Machine Learning. Management Science. 2018;65(4):1867–1890. doi: 10.1287/mnsc.2017.2965

34. Wolfers J, Zitzewitz E. Interpreting Prediction Market Prices as Probabilities. National Bureau of Economic Research; 2006. 12200.

35. Simonsohn U. Small Telescopes Detectability and the Evaluation of Replication Results. Psychological Science. 2015;26(5):559–569. doi: 10.1177/0956797614567341 25800521

36. Kasy M, Andrews I. Identification of and Correction for Publication Bias. American Economic Review. 2019;109(8):2766–2294. doi: 10.1257/aer.20180310

37. Bradley AP. The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition. 1997;30(7):1145–1159. doi: 10.1016/S0031-3203(96)00142-2

38. Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32. doi: 10.1023/A:1010933404324

39. Forsell E, Viganola D, Pfeiffer T, Almenberg J, Wilson B, Chen Y, et al. Predicting Replication Outcomes in the Many Labs 2 Study. Journal of Economic Psychology. 2018. doi: 10.1016/j.joep.2018.10.009

40. Inbar Y. Association between Contextual Dependence and Replicability in Psychology May Be Spurious. Proceedings of the National Academy of Sciences. 2016;113(34):E4933–E4934. doi: 10.1073/pnas.1608676113

41. Altmejd A. Registration of Predictions; 2017.

42. Gelman A, Carlin J. Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspectives on Psychological Science. 2014;9(6):641–651. doi: 10.1177/1745691614551642 26186114

43. Meehl PE. Clinical Versus Statistical Prediction: A Theoretical Analysis and a Review of the Evidence. Minneapolis, MN, US: University of Minnesota Press; 1954.

44. Dawes RM. The Robust Beauty of Improper Linear Models in Decision Making. American Psychologist. 1979;34(7):571–582. doi: 10.1037/0003-066X.34.7.571

45. Bishop MA, Trout JD. Epistemology and the Psychology of Human Judgment. Oxford University Press; 2004.

46. Youyou W, Kosinski M, Stillwell D. Computer-Based Personality Judgments Are More Accurate than Those Made by Humans. Proceedings of the National Academy of Sciences. 2015;112(4):1036–1040. doi: 10.1073/pnas.1418680112

47. Kleinberg J, Lakkaraju H, Leskovec J, Ludwig J, Mullainathan S. Human Decisions and Machine Predictions. The Quarterly Journal of Economics. 2017;133(1):237–293. doi: 10.1093/qje/qjx032 29755141

48. Masnadi-Shirazi H, Vasconcelos N. Asymmetric Boosting. In: Proceedings of the 24th International Conference on Machine Learning. ICML’07. New York, NY, USA: ACM; 2007. p. 609–619.

49. Campbell DT. Assessing the Impact of Planned Social Change. Evaluation and Program Planning. 1979;2(1):67–90. doi: 10.1016/0149-7189(79)90048-X

50. Kleinberg J, Mullainathan S, Raghavan M. Inherent Trade-Offs in the Fair Determination of Risk Scores. arXiv:160905807. 2016;.

51. Meng XL. Statistical Paradises and Paradoxes in Big Data (I): Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election. The Annals of Applied Statistics. 2018;12(2):685–726. doi: 10.1214/18-AOAS1161SF

52. Simons DJ, Holcombe AO, Spellman BA. An Introduction to Registered Replication Reports at Perspectives on Psychological Science. Perspectives on Psychological Science. 2014;9(5):552–555. doi: 10.1177/1745691614543974 26186757

Článek vyšel v časopise


2019 Číslo 12
Nejčtenější tento týden