# A novel statistical framework for quantifying risks and benefits of AI automation in screening mammography

**Authors:** Michael H. Bernstein, Maggie Chung, Adam Yala, Grayson L. Baird

PMC · DOI: 10.1371/journal.pdig.0001231 · PLOS Digital Health · 2026-02-26

## TL;DR

This paper introduces a statistical framework to help radiology practices choose AI 'rule-out' thresholds in mammography screening by balancing workload reduction and cancer detection risks.

## Contribution

A novel statistical framework is introduced to quantify trade-offs between AI automation benefits and risks in mammography screening.

## Key findings

- At a 0.20 AI score threshold, 75% caseload reduction is achieved with 0.14% adjusted net false omission rate.
- A 0.05 threshold reduces caseload by 36% with no additional missed cancers.
- The framework helps practices evaluate AI rule-out strategies based on local data and risk tolerance.

## Abstract

AI has been proposed as a triage or “rule-out” device to reduce radiologist workload, but it is presently unclear how an AI “rule-out” threshold should be determined. We present a framework for determining an optimal threshold. Using a retrospective study design, 114,229 bilateral 2D digital screening mammograms were analyzed from 2006-2023 at a single study site. All mammograms were given an AI score using Mirai, an open-source deep-learning model which provides a 1-year risk score. Several metrics were examined using two thresholds for determining ruled out versus retained cases: 1) Caseload Reduction Rate (CRR; percent of caseload reduced due to rule-out), 2) Gross AI False Omission Rate (G-FOR; probability of a patient having breast cancer if ruled out), 3) AI Net False Omission Rate (N-FOR; probability of a patient having breast cancer if ruled out and the radiologist would have caught in standard care [i.e., no triage]), 4) AI Adjusted Net False Omission Rate (30%) (AN-FOR[30%]; N-FOR adjusted for the hypothetical scenario where radiologists detect an extra 30% of breast cancers among AI retained cases). The two thresholds were risk scores of 0.2 (Youden’s J) and 0.05 (AN-FOR[30%]=0). The former is mathematically optimal; the latter reflects a threshold where AI “rule-out” does not introduce any total increase in False Negatives. At the 0.20 threshold, G-FOR, N-FOR, and AN-FOR (30%) are 0.26%, 0.17%, and 0.14%, respectively (223, 141, and 121, respectively, missed cancer cases) and CRR = 75%. At the 0.05 threshold, the G-FOR, N-FOR, and AN-FOR (30%) are 0.12%, 0.07%, and 0.00% (49, 30, and 0, respectively, missed cancer cases) and CRR = 36%. We demonstrate how radiology practices can consider the trade-offs of using different AI scores as “rule-out” thresholds. At the AN-FOR rate of 30%, the Youden’s J threshold results in 121 additional missed cancers for a 75% caseload reduction. We estimate no additional missed cancers at a 36% caseload reduction.

Radiology practices are facing rapidly rising imaging volume and workforce shortages. One proposed approach is to use AI to “rule out” the lowest-risk screening mammograms to allow radiologists to focus on a smaller set of exams that are more likely to contain cancer. However, it is unclear how rule-out thresholds should be selected (i.e., how to determine which proportion of cases should be ruled out versus retained for interpretation). We present a statistical framework to quantify the trade-offs between workload reduction and diagnostic risk when considering different potential thresholds for AI rule-out in triage. Using risk scores from 114,229 screening mammograms and an open-source AI model (Mirai), we translate different AI score thresholds into error rates that are clinically relevant and interpretable, and also calculate the workload reduction rate at each threshold. Importantly, our framework distinguishes additional cancers that AI “rule-out” would miss, from cancers that would have also been missed under standard practice without AI “rule-out”. We also consider how using AI “rule-out” may impact radiologist performance such as the possibility that radiologists may detect more cancers among retained cases when the reading pool is smaller and enriched. Our framework is an illustration of how different threshold choices can produce different balances of caseload reduction and net missed cancers. Rather than recommending a specific threshold, this work provides a generalizable tool for practices and policymakers to evaluate AI “rule-out” strategies using local data and specific risk tolerances.

## Linked entities

- **Diseases:** breast cancer (MONDO:0004989)

## Full-text entities

- **Diseases:** AI (MESH:C538142), Cancers (MESH:D009369), aneurysms (MESH:D000783), lung cancer (MESH:D008175), AN-FOR (MESH:D000275), breast cancer (MESH:D001943), burnout (MESH:D002055), abdominal aortic aneurysm (MESH:D017544)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12944777/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12944777/full.md

## References

31 references — full list in the complete paper: https://tomesphere.com/paper/PMC12944777/full.md

---
Source: https://tomesphere.com/paper/PMC12944777