# Don't Stop Me Now, `Cause I'm Having a Good Time Screening: Evaluation of Stopping Methods for Safe Use of Priority Screening in Systematic Reviews

**Authors:** Tim Repke, Francesca Tinsdeall, Diana Danilenko, Sergio Graziosi, Finn Müller‐Hansen, Lena Schmidt, James Thomas, Gert van Valkenhoef

PMC · DOI: 10.1002/cesm.70068 · Cochrane Evidence Synthesis and Methods · 2026-01-21

## TL;DR

This paper evaluates methods to safely stop screening documents in systematic reviews using machine learning, finding that most existing methods are unreliable or miss potential savings.

## Contribution

The paper introduces a novel evaluation framework and highlights the need for improved stopping criteria in priority screening.

## Key findings

- Most existing stopping methods either miss relevant records or fail to save work.
- Only one method reliably meets recall targets but stops conservatively.
- Good rankings are crucial for work-saving potential but are hard to maintain.

## Abstract

Priority screening has the potential to reduce the number of records that need to be annotated in systematic literature reviews. So‐called technology‐assisted reviews (TAR) use machine‐learning with prior include/exclude annotations to continuously rank unseen records by their predicted relevance to find relevant records earlier. In this article, we present a systematic evaluation of methods to determine when it is safe to stop screening when using prioritization.

We implement an open‐source evaluation framework that features a novel method to generate rankings and simulate priority screening processes for 81 real‐world data sets. We use these simulations to evaluate 15 statistical or rule‐based (heuristic) stopping methods, testing a range of hyperparameters for each.

The work‐saving potential and performance of stopping criteria heavily rely on “good” rankings, which are typically not achieved by a single ranking algorithm across the entire screening process. Our evaluation shows that almost all existing stopping methods either fail to reliably stop without missing relevant records or fail to utilize the full potential work‐savings. Only one method reliably meets the set recall target, but stops conservatively.

Many digital evidence synthesis tools provide priority screening features that are already used in many research projects. However, the theoretical work‐savings demonstrated in retrospective simulations of prioritization can only be unlocked with safe and reproducible stopping criteria. Our results highlight the need for improved stopping methods and guidelines on how to responsibly use priority screening. We also urge screening platforms to provide indicators and authors to transparently report metrics when automating (parts of) their synthesis.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12825451/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12825451/full.md

## References

64 references — full list in the complete paper: https://tomesphere.com/paper/PMC12825451/full.md

---
Source: https://tomesphere.com/paper/PMC12825451