The Difference Between "Replicable" and "Not replicable" is not Itself Scientifically Replicable
Berna Devezer, Erkan O. Buzbas

TL;DR
This paper argues that current methods for assessing scientific replicability are fundamentally limited, making it impossible to reliably distinguish between 'replicable' and 'not replicable' results due to inherent heterogeneity and data constraints.
Contribution
It formalizes two statistical models of non-exact replications, revealing fundamental limitations in current replication rate estimates and their interpretation.
Findings
Small variability in replicability rates creates an irreducible variance floor.
Standard data cannot reliably distinguish between high and low replicability sequences.
Aggregating heterogeneous literatures produces misleading averages of replicability.
Abstract
Replication studies estimate the replicability rate of scientific results by aggregating binary verdicts of experiments. Exact replications are rarely attainable, so most replication sequences are non-exact. Experiments differ in ways that matter and do not share a single data-generating process. We formalize two statistical interpretations of non-exactness. In a shared latent rate (benchmark) model, experiments are exchangeable and depend on a common random replicability rate. In a conditionally independent rates (operational) model, each experiment has its own replicability rate drawn from a population distribution. Under the benchmark model, even small variability among replicability rates induces an irreducible variance floor on the estimated mean replicability rate that no amount of replication can eliminate. Under the operational model, the degree of non-exactness is not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
