A novel statistical framework for quantifying risks and benefits of AI automation in screening mammography

Michael H. Bernstein; Maggie Chung; Adam Yala; Grayson L. Baird

PMC · DOI:10.1371/journal.pdig.0001231·February 26, 2026

A novel statistical framework for quantifying risks and benefits of AI automation in screening mammography

Michael H. Bernstein, Maggie Chung, Adam Yala, Grayson L. Baird

PDF

Open Access

TL;DR

This paper introduces a statistical framework to help radiology practices choose AI 'rule-out' thresholds in mammography screening by balancing workload reduction and cancer detection risks.

Contribution

A novel statistical framework is introduced to quantify trade-offs between AI automation benefits and risks in mammography screening.

Findings

01

At a 0.20 AI score threshold, 75% caseload reduction is achieved with 0.14% adjusted net false omission rate.

02

A 0.05 threshold reduces caseload by 36% with no additional missed cancers.

03

The framework helps practices evaluate AI rule-out strategies based on local data and risk tolerance.

Abstract

AI has been proposed as a triage or “rule-out” device to reduce radiologist workload, but it is presently unclear how an AI “rule-out” threshold should be determined. We present a framework for determining an optimal threshold. Using a retrospective study design, 114,229 bilateral 2D digital screening mammograms were analyzed from 2006-2023 at a single study site. All mammograms were given an AI score using Mirai, an open-source deep-learning model which provides a 1-year risk score. Several metrics were examined using two thresholds for determining ruled out versus retained cases: 1) Caseload Reduction Rate (CRR; percent of caseload reduced due to rule-out), 2) Gross AI False Omission Rate (G-FOR; probability of a patient having breast cancer if ruled out), 3) AI Net False Omission Rate (N-FOR; probability of a patient having breast cancer if ruled out and the radiologist would have…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases8

breast cancer AI Cancers aneurysms lung cancer AN-FOR burnout abdominal aortic aneurysm

Figures2

Click any figure to enlarge with its caption.

Fig 1 — False Omission Rate by Caseload Reduction Rate.X-axis is caseload reduction rate (10% to 100%) and Y-axis is False Omission Rate (0.0% to 0.70%). Youden refers to Youden’s J (thin black line). G-FOR is Gross False Omission Rate (solid red). N-FOR is Net False Omission Rate (longest dash, bright blue). AN-FOR 10% (long dash, light blue), AN-FOR 30% (short dash, medium blue), AN-FOR 50% (short dash, dark blue), AN-FOR 70% (shortest dash, grey blue) refer to the Adjusted Net False Omission Rate at various percentages of additional breast cancers that radiologists would detect (10%, 30%, 50%, and 70% respectively) in AI-retained cases using an AI “rule-out” model relative to standard of care. Bootstrapped 95% confidence interval curve is shown in red-shaded area.

Fig 2 — False Discovery Rate by Caseload Reduction Rate.X-axis is caseload reduction rate (10% to 100%) and Y-axis is False Discovery Rate (60% to 100%). Thick black line is the relationship between Caseload Reduction Rate and False Discovery Rate Youden refers to Youden’s J (thin black line). AN-FOR 10% (long dash, light blue), AN-FOR 30% (short dash, medium blue), AN-FOR 50% (short dash, dark blue), AN-FOR 70% (shortest dash, grey blue) refer to the Adjusted Net False Omission Rate at various percentages of additional breast cancers that radiologists would detect (10%, 30%, 50%, and 70% respectively) in AI-retained cases using an AI “rule-out” model relative to standard of care. Bootstrapped 95% confidence interval curve is shown in red-shaded area.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in cancer detection · Artificial Intelligence in Healthcare and Education · Radiology practices and education

Full text

Introduction

The increasing use of artificial intelligence (AI) in radiology has prompted considerations about its potential in addressing the field’s mounting challenges. In recent years, the workload of radiologists has grown significantly. For instance, one study found that the workload for on-call radiologists in the Emergency Department quadrupled between 2006 and 2020 [1]. Another study that examined billed work relative value units (RVUs) among more than 35,000 academic radiologists found a 60% increase in workload from 2008 to 2020 [2]. This growing burden contributes to rising rates of burnout [3,4]. Furthermore, the number of individuals entering radiology residency has not kept pace with the rise of imaging volume [5], creating a growing workforce imbalance that is unlikely to be resolved in the near future.

AI has the potential to alleviate some of this burden. In particular, one promising application to improve efficiency is using AI as a triage or “ rule-out” device. By identifying screening mammograms that are extremely low-risk, AI can reduce the number of cases that require interpretation by a radiologist. The overwhelming majority of screening mammogram studies (>99%) are cancer-negative given the low prevalence of breast cancer in the screening population. As a result, radiologists devote the vast majority of their screening interpretation time to normal exams. Reductions in the number of low-risk cases requiring radiologist interpretation could translate into large cumulative time savings and workload reduction. This “rule-out” approach is well-suited for pathologies with a low prevalence rate, where there are many true negative cases that can be ruled out at the cost of very few false negatives. Thus, AI rule-out may be suited for screening mammograms where fewer than 1% of cases are positives [6,7]. The goal is to safely exclude the majority of normal cases and allow radiologists to concentrate on more suspicious exams.

Several groups have proposed using AI triage for “rule-out” in radiology; in breast imaging, studies have suggested that AI “rule-out” can perform comparably to, and in some cases better than, standard of care where radiologists interpret all mammograms [8–15]. It also aligns with a growing recognition that successful deployment of AI in medical imaging should not simply assist radiologists within the same interpretive task, but should instead take on distinct, complementary responsibilities that are aligned with each party’s strengths [16]. AI risk and detection models are increasingly evaluated in prospective, workflow-integrated settings that extend beyond interpretive assistance, including early implementation work exploring their use in prioritizing screening mammography workflows [17,18]. The potential role of AI rule-out of negative exams is an active area of discussion among regulatory bodies and professional societies. In this evolving landscape, a rigorous quantitative framework for evaluating potential AI “rule-out” thresholds, such as the one proposed in this study, may help inform future policy and implementation efforts.

One critical component of AI “rule-out” is determining the appropriate Triage Threshold. Most AI algorithms generate a continuous risk score for each image, with higher scores indicating a greater likelihood of pathology. However, where the precise cut-off should be placed for distinguishing which cases are ruled out (i.e., non-triage) versus reviewed by a radiologist (i.e., triage) remains an open question [19–22]. Setting the threshold depends on balancing a variety of benefits and risks, which we discuss below. For clarity, these are divided into “ruled out” and “retained” cases. Prior studies have typically selected thresholds based on conventional diagnostic performance metrics (e.g., sensitivity, specificity, ROC-AUC) or workload-reduction targets [8–12], but do not qualify the error burden introduced by a proposed rule-out policy and its downstream workflow impact compared to standard of care.

In this article, we present a framework for determining the optimal AI “rule-out” threshold for screening mammogram automation. We outline key metrics to evaluate the trade-offs between benefits and risks at different thresholds. Specifically, using AI risk scores from the Mirai model applied to 114,229 screening mammograms, we simulate rule-out thresholds to quantify their effects on caseload reduction and cancer detection. We propose approaches for identifying the optimal threshold based on these metrics. Importantly, the objective of this study is not to assess the performance of a specific AI system or threshold, nor to recommend a particular operating point. Rather, the goal of this study is to present a general framework for estimating false-positive and false-negative error rates in the context of AI “rule-out” as a function of caseload reduction relative to standard practice (i.e., no AI).

Methods

Ethics statement

University of California, San Francisco (UCSF) Institutional Review Board gave ethical approval for this Health Insurance Portability and Accountability Act–compliant study and waived the requirement for written informed consent.

Operational definitions

Overview.

Below, and in Table 1, we provide an operational definition for all terms used throughout the manuscript. A key consideration of this analysis is that examining metrics of AI performance, in isolation, is insufficient, because it fails to account for important counterfactuals and the AI-radiologist interaction (e.g., did AI fail to detect a pathology that the radiologist would have detected under standard care, or would the radiologist have also missed it?)

Table 1: Key Metrics and Definitions.

Benefits and risks in ruled-out cases.

The primary benefit of ruling out cases is caseload reduction, which can be quantified as the Caseload Reduction Rate (CRR) (Table 1). The higher the threshold, the higher the CRR; that is, the caseload reduction for radiologists will be higher when a more stringent (i.e., higher) threshold is set, ruling out a larger pool of cases. However, the benefit of a higher CRR must be weighed against a variety of other considerations.

First, one must consider how accurate an AI is at correctly ruling out cancer, given the cancer prevalence in the population; this is reflected by the AI Negative Predictive Value (AI-NPV). AI-NPV is the probability that a patient ruled out by AI truly does not have breast cancer. Some negative cases ruled out by AI might have otherwise been recalled by the radiologist in standard care (interpreting mammograms without AI triaging), potentially leading to unnecessary, costly, and stress-inducing diagnostic imaging and biopsies that turn out to be benign.

The aforementioned benefit must be carefully weighed against the AI Gross False Omission Rate (G-FOR, or 1-AI-NPV), which is the probability that a patient ruled out by AI actually has breast cancer. As the threshold is raised to exclude more cases, the G-FOR increases. That is, the CRR and the G-FOR come at a clear tradeoff; the more cases AI rules out, the higher the G-FOR will be. Nonetheless, it is important to note that not all cancers missed by AI in the ruled-out cases would have been detected by radiologists under standard practice (SP) (i.e., radiologist workflow absent an AI triaging model). That is, some cancer cases would likely have been missed regardless of whether AI triaging was used. To account for this, we define the AI Net False Omission Rate (N-FOR) as the G-FOR minus cancer cases that would have been missed by AI and radiologists (i.e., “deduct” cancer cases mutually missed by both radiologists and AI “rule-out” from the numerator).

Benefits and risks in retained cases.

The AI Positive Predictive Value (AI-PPV) reflects the probability that a patient has breast cancer given that the case was retained for radiologist review. The AI False Discovery Rate (AI-FDR, or 1-AI-PPV) refers to the probability that a patient retained by AI for radiologist review does not actually have breast cancer. The higher the “rule-out” threshold, the more cases AI will rule-out (i.e., the larger the CRR). This means that remaining (i.e., retained) cases are more likely to be true positives, which increases the AI-PPV and reduces the AI-FDR. However, decreasing the number of retained cases (and by definition also increasing the number of ruled-out cases) can have important implications for how they are interpreted.

Radiologist performance may improve as the retained reading pool size decreases due to reading fewer cases [23], reading an enriched batch with higher prevalence [24–26], and by consciously or unconsciously knowing that the cases were triaged [22] (i.e., anchoring or automation bias). These additional cancer detections could potentially offset a portion of the cancers missed among ruled-out cases due to the use of AI “rule-out” (N-FOR). Taking this into account, the Adjusted Net False Omission Rate (AN-FOR) refers to the probability of a patient having breast cancer that would have been detected by a radiologist in standard practice if ruled out, adjusted for the additional cancer detections due to AI “rule-out” that would have been missed in standard practice (i.e., “credit” cancer cases that radiologists would have otherwise missed without AI triaging them).

Another risk worth considering is that although radiologists are more likely to catch cancer cases they would have otherwise missed had AI not retained them, it is also likely that for the same reason, radiologists may also increase unnecessary recalls (i.e., radiologists recall non-cancer cases they would not have otherwise recalled had they not been retained by triage) [22].

Simulation methods

To illustrate the trade-offs associated with different AI “rule-out” thresholds, we conducted a “simulation” using risk scores from a deep learning model with screening mammography.

Study Sample. We conducted a single institution retrospective review of 114,229 bilateral 2D digital screening mammograms acquired between January 2006 and January 2023. Only screening examinations with at least 12 months of imaging follow-up within our health system were included; exams without complete follow-up were excluded from all analyses. All examinations in the cohort included standard 2D digital mammography images; a subset of exams also included tomosynthesis (DBT). Radiologists interpreted exams using the full available clinical dataset, including DBT when available. Exams with histopathologically confirmed breast cancer within 12 months of the screening mammogram were considered positive. Exams with at least 12 months of follow-up without a breast cancer diagnosis were considered negative. Based on these criteria, 864 cases (0.76%) were identified as positive. Of the positive cancers, screen-detected cancers were defined as examinations in which the patient was recalled at the index screening mammogram and subsequently diagnosed with breast cancer. Cancers were classified as “missed by radiology” (interval cancers) if the patient was not recalled at the index screening exam but was diagnosed with breast cancer within 12 months.

AI Model. Mammograms were assessed using Mirai, an open-source deep learning model available at https://github.com/yala/Mirai. All analyses were performed using the publicly released v0.5.0 codebase. No local modifications to the model architecture or weights were made [27,28]. Mirai was trained exclusively on non-UCSF data, with complete institutional separation from the evaluation cohort and no risk of data leakage. One-year risk scores (henceforth “scores” or “Mirai scores”) were used to simulate “rule-out” thresholds. The Mirai model accepts 2D digital mammography images only, and therefore only 2D images were used as inputs for all simulations.

Threshold Simulation Framework. We simulated various “rule-out” thresholds based on Mirai scores.

Modeling Assumptions. To model AN-FOR, we simulated four possible scenarios in which 10%, 30%, 50%, or 70% of missed cancers in standard practice were detected by using AI “rule-out”. These are intended as hypothetical scenarios only; the true rate is unknown. Likewise, our results utilize an AN-FOR rate of 30%, but this is for illustrative purposes only and should not be interpreted as suggesting 30% is the value that best corresponds with clinical practice.

Statistics. All modeling was conducted using SAS 9.4 (SAS Cary, NC), where sensitivities and specificities were estimated using the LOGISTIC procedure with the %ROCPLOT macro, and PPV, NPV, FDR, and FOR were calculated using Bayes’ Theorem (see Appendix A for PPV and NPV equations). The base rate of cancer was 0.76%. Confidence limits were generated for FDR and FOR using bootstrapping with the SURVEYSELECT procedure using 1000 replicates.

Results

Approaches to identifying rule-out threshold

Data were simulated using two “rule-out” thresholds that can be generalized across practices. The first uses diagnostic performance—Youden’s J—to define a threshold by optimizing the balance of sensitivity and specificity. The second defines a threshold using an outcome, in this case, avoiding any overall increase in missed breast cancers compared to standard practice without triage. That is, this threshold is set so that an AN-FOR of 0 is achieved, meaning all cancers missed by using AI triage (rule-out cases) are then offset by an identical number of cancer cases that a radiologist would catch because they were retained.

Identifying threshold using diagnostic performance (Youden’s J)

For these data, we observed that the Youden’s J value is a Mirai score of 0.20, achieving a sensitivity of 74% and a specificity of 75% (see Tables 2 and 4). Given a local prevalence of 0.76%, this translated into ruling out 85,220 cases and retaining 29,009 cases, resulting in a CRR of 75% (85,220/114,229). Of these ruled-out cases, 223 had breast cancer and 84,997 did not, thus achieving a G-FOR of 223/85,220 (0.26%). Of the retained cases, 641 had breast cancer and 28,368 did not, thus achieving an AI-FDR of 97.8%.

Table 2: Error Rates Using Youden’s J Threshold.

Table 3: Key for Table 4.

Table 4: Table of metrics and outcomes.

Of the 223 breast cancer cases that were ruled out, 82 were not recalled. That is, 82 were also missed by radiologists in standard of care while they recalled the remaining 141, thus achieving an N-FOR of 141/85,220 or 0.17%.

Regarding the retained cases, AI retained 66 cases that radiologists missed. Assuming radiologists detect 10%, 30%, 50% or 70% of these cases in AI “rule-out”, the adjusted net number of missed cancers in AI “rule-out” would be reduced to 7, 20, 33, or 46–134, 121, 108, or 95, respectively. This would correspond to Adjusted Net FOR values of 0.16%, 0.14%, 0.13%, and 0.11%, respectively. These values are visualized in Table 4 and Fig 1 (with invasive-cancer–only results shown in S1 Fig).

False Omission Rate by Caseload Reduction Rate.X-axis is caseload reduction rate (10% to 100%) and Y-axis is False Omission Rate (0.0% to 0.70%). Youden refers to Youden’s J (thin black line). G-FOR is Gross False Omission Rate (solid red). N-FOR is Net False Omission Rate (longest dash, bright blue). AN-FOR 10% (long dash, light blue), AN-FOR 30% (short dash, medium blue), AN-FOR 50% (short dash, dark blue), AN-FOR 70% (shortest dash, grey blue) refer to the Adjusted Net False Omission Rate at various percentages of additional breast cancers that radiologists would detect (10%, 30%, 50%, and 70% respectively) in AI-retained cases using an AI “rule-out” model relative to standard of care. Bootstrapped 95% confidence interval curve is shown in red-shaded area.

Identifying threshold using outcomes

Another approach to identifying the threshold is by considering the type of error and number of errors that would result from AI triage based on historical data. As shown in Fig 1 and Tables 3 and 4, depending on the percentage of additional breast cancer cases (i.e., 10%, 30%, 50%, and 70%) that radiologists would have detected among those retained by AI “rule-out” (compared to standard of care), the rule-out threshold can be set by determining the caseload reduction rate where AN-FOR intersects a certain value (here 0). As discussed above, this threshold corresponds to no additional missed cancers overall (among both retained and ruled out cases) relative to standard practice. As illustrated in Fig 1 and Table 4 (bold), assuming radiologists detect an additional 30% of missed cancers in cases retained by AI, a threshold of Mirai = 0.05 would achieve an AN-FOR of 0, which would translate into a CRR of about 36%. If radiologists detect an additional 70% of missed cancers, a threshold of Mirai = 0.09 would achieve an AN-FOR of 0, which would translate into a CRR of about 53%.

These CRR values can then be used to examine the corresponding number of false positives. As seen in Fig 2, the AI-FDR was between about 98% and 99% for all thresholds considered, indicating that FDR was largely stable. Given that false positives are unlikely to vary significantly, mainly because of low cancer prevalence [21], false negatives will be the primary focus here. Invasive-cancer–only results are shown in S2 Fig. To illustrate downstream clinical impact, we also quantified the number of benign and high-risk biopsies that would be avoided under each AI rule-out threshold. As shown in S1 Table, increasing caseload reduction is associated with a corresponding increase in potentially avoidable biopsies among ruled-out cases.

False Discovery Rate by Caseload Reduction Rate.X-axis is caseload reduction rate (10% to 100%) and Y-axis is False Discovery Rate (60% to 100%). Thick black line is the relationship between Caseload Reduction Rate and False Discovery Rate Youden refers to Youden’s J (thin black line). AN-FOR 10% (long dash, light blue), AN-FOR 30% (short dash, medium blue), AN-FOR 50% (short dash, dark blue), AN-FOR 70% (shortest dash, grey blue) refer to the Adjusted Net False Omission Rate at various percentages of additional breast cancers that radiologists would detect (10%, 30%, 50%, and 70% respectively) in AI-retained cases using an AI “rule-out” model relative to standard of care. Bootstrapped 95% confidence interval curve is shown in red-shaded area.

Comparing thresholds

To assess the trade-off between errors and benefits, we compare two thresholds: Mirai score of 0.20 corresponding to a 75% CRR (Youden’s J) and Mirai score of 0.05 corresponding to a 36% caseload reduction assuming AN-FOR of 0 where 30% of additional breast cancers would have been detected among retained cases by radiologists using an AI “rule-out” model compared to standard of care.

At the 0.20 threshold, the G-FOR, N-FOR, and AN-FOR (30%) are 0.26%, 0.17%, and 0.14%, respectively. This corresponds to 223, 141, and 121 missed cancer cases for the benefit of reading 85,220 fewer cases with an FDR of 97.8%. In contrast, at the 0.05 threshold, the G-FOR, N-FOR, and AN-FOR (30%) are 0.12%, 0.07%, and 0.00% (rounded), corresponding to 49, 30, and 0 missed cancer cases for the benefit of reading 41,127 fewer cases with an FDR of 98.9%. Tables 3 and 4 provide all combinations for comparison.

Discussion

We demonstrate how radiology practices can consider the trade-offs of using different AI scores to determine the “rule-out” threshold. Using the Mirai AI algorithm and historical data, our simulation demonstrates how a risk-benefit analysis could be quantified. Crucially, the purpose of this framework is not to advocate for a specific threshold or risk-benefit ratio or to evaluate a particular AI system. Rather, the goal of the present study is to demonstrate how a risk-benefit ratio could be quantified to inform policy and clinical implementation of AI “rule-out”. All numerical values provided are illustrative and are not intended as recommendations for clinical use.

The optimal threshold will vary depending on the AI model, the pathology (and the corresponding trade-offs of false positives and false negatives), the AI model’s sensitivity and specificity for a local population, the prevalence of the local population, the local caseload volume and radiologist staffing ability, and institutional risk tolerances. Our simulation highlights how error rates (risk) and caseload reduction rate (benefit) can be estimated using historical data. This estimation not only accounts for the type of errors (i.e., false positive and false negative) but also the number of errors (i.e., false discovery and omission rates instead of false positive and negative rates).

Our proposed framework could also be generalized to these screening domains. Many accepted screening programs operate in populations with disease prevalence below 1%. For example, low-dose CT lung cancer screening detects malignancy in 0.4-0.9% of examinations [29,30]. Abdominal ultrasound screening for abdominal aortic aneurysm finds clinically significant aneurysms requiring intervention in less than 0.5% of screened patients [31]. Thus, across modalities, the overwhelming majority (>99%) of screening examinations are disease-negative or findings that do not require intervention.

Research examining the performance of AI in radiology typically rely on sensitivity, specificity, and AUC-ROC metrics [8–12]. Error rates are critical for interpreting AI feedback. Fan et al. propose evaluating AI “rule-out” using PPV and NPV. The current study builds upon their approach in two key ways. Namely, Fan et al. do not account for key counterfactuals such as cancers that would have been missed by radiologists without triage and cancer only detected with triage because of changes in radiologist performance [13]. As such, the degree to which AI “rule-out” impacts diagnostic performance relative to no AI “rule-out” is not fully captured. In addition, Fan et al. propose using expected utility (EU) to assess AI “rule-out”. However, this relies on baseline relative utility values, which are difficult to define and when defined, may be difficult to justify, economically, ethically, and otherwise.

While our simulation focused on the number of any missed cancers, the type (e.g., in situ versus invasive) and stages/grade of cancers missed by AI could be incorporated to further assess the clinical significance of triage-related errors. What is more, we only evaluated cancers diagnosed within a year of the screening mammogram; other time frames (e.g., 1 and 2-year cancer outcomes) could be incorporated as well. Finally, for simplicity, we calculated the G-FOR, N-FOR, AN-FOR, and FDR using the direct rates, although confidence, prediction, or credible interval estimates could be used instead. Establishing specific thresholds or noninferiority margins is beyond the scope of this study as it would require multi-stakeholder policy decisions incorporating institutional risk tolerance, legal and regulatory context, radiologist availability, patient preferences, and operational priorities. However, our framework provides a quantitative approach for exploring and tailoring these decisions to individual practice settings. Additional limitations include: the use of only one study site, hypothetical percents used for AN-FOR values, the lack of a prospective study validation, and the lack of subanalyses by density or age. Moreover, the downstream effects of FDR (e.g., biopsy rate) may not be constant across different “rule-out” thresholds, a consideration that is not captured in this study. In addition, some patients may have undergone cancer diagnosis beyond the 12-month follow-up window and therefore were not captured in our outcome definition. Importantly, these limitations primarily affect the precision and generalizability of the illustrative numerical results and do not alter the central contribution of this work, which is the presentation of a general conceptual framework for evaluating AI “rule-out” strategies.

Future directions and implications

Along with a framework for determining a threshold for AI “rule-out” of screening mammograms, there are several important considerations that must be addressed before AI “rule-out” can be implemented in clinical practice. First, prospective validation of AI rule-out strategies is needed. In this study, we evaluated a range of hypothetical values (10–70%, with a focal scenario of 30%), which, while illustrative, highlight a critical unknown that must be resolved empirically. Past research has shown that automation bias plays a critical role in radiology [22], and other work has demonstrated that increased prevalence results in more abnormal images being flagged [25]. The AN-FOR rate will likely change as a function of threshold, and future work is needed to more carefully consider that level of nuance. Examining this topic empirically would require a multi-case multi-reader (MCMR) design where radiologists interpret imaging without AI, and then with AI at different “rule-out” thresholds. This validation will be important for understanding how AI “rule-out” impacts radiologist performance in the retained cases. Also, when local data are examined historically, practices should be mindful of how changes in technology, prevalence rate, and workflow may impact performance metrics and select the appropriate retrospective window. From a regulatory perspective (e.g., Food and Drug Administration or European Medicines Agency), one option might be to set the “rule-out” threshold at the point where AN-FOR rate is 0, after first running the algorithm on local data and estimating the relevant AN-FOR percentage, using the aforementioned MCMR study design. This would translate to caseload reduction without a net increase in the number of FNs. By quantifying error rates attributable to rule-out and accounting for downstream effects in retained cases, this framework may assist regulatory bodies in more comprehensively assessing benefit–risk profiles, informing study design, operating-point selection, and post-market monitoring strategies for AI triage “rule-out” systems.

Second, standards need to be developed for the safe deployment of AI “rule-out” tools in clinical settings and address approaches for ongoing monitoring of AI performance and safety over time. Third, there are psychological, ethical, legal, economic, and insurance considerations that must be weighed if implementing “rule-out”. Finally, there will need to be significant changes to the policy and regulatory landscape to allow AI “rule-out” in clinical practice. Addressing these considerations is necessary for the implementation of AI “rule-out”.

Supporting information

S1 FigFalse Omission Rate by Caseload Reduction Rate For Invasive Cancers.False omission rate (FOR) plotted against caseload reduction rate when restricting outcomes to invasive cancers only. The x-axis denotes caseload reduction rate (10–100%), and the y-axis denotes false omission rate (0–0.70%). The thin black vertical line indicates the threshold selected by Youden’s J statistic. The solid red curve represents the gross false omission rate (G-FOR), and the solid bright-blue curve represents the net false omission rate (N-FOR). Dashed curves show the adjusted net false omission rate (AN-FOR), assuming that radiologists detect an additional 10%, 30%, 50%, or 70% of cancers in AI-retained cases relative to standard of care (light blue long dash, medium blue short dash, dark blue short dash, and gray-blue shortest dash, respectively).(DOCX)

S2 FigFalse Discovery Rate as a Function of Caseload Reduction for Invasive Cancers.False discovery rate (FDR) plotted against caseload reduction rate when restricting outcomes to invasive cancers only. The x-axis denotes caseload reduction rate (10–100%), and the y-axis denotes false discovery rate (60–100%). The thick black curve represents the empirical relationship between caseload reduction and FDR. The thin black vertical line indicates the threshold selected by Youden’s J statistic. Dashed curves represent adjusted net false omission rate (AN-FOR) scenarios assuming that radiologists detect an additional 10%, 30%, 50%, or 70% of cancers in AI-retained cases relative to standard of care (light blue long dash, medium blue short dash, dark blue short dash, and gray-blue shortest dash, respectively).(DOCX)

S1 TableBiopsies Avoided at Each AI Rule-Out Threshold.Number of biopsies resulting in high-risk or benign pathology that would be retained or ruled out at each AI score threshold. For each threshold, counts are reported separately for high-risk and benign pathology. Values in the “ruled-out” columns represent biopsies that could be avoided under the corresponding AI rule-out strategy, whereas values in the “retained” columns represent biopsies that would remain in the clinical workflow and be performed downstream of a screening recall.(DOCX)

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bruls RJM, Kwee RM. Workload for radiologists during on-call hours: dramatic increase in the past 15 years. Insights Imaging. 2020;11: 121. doi: 10.1186/s 13244-020-00925-z 33226490 PMC 7683675 · doi ↗ · pubmed ↗
2Burns J, Chung Y, Rula EY, Duszak R Jr, Rosenkrantz AB. Evolving Trainee Participation in Radiologists’ Workload Using A National Medicare-Focused Analysis From 2008 to 2020. J Am Coll Radiol. 2025;22(1):98–107. doi: 10.1016/j.jacr.2024.08.029 39453332 · doi ↗ · pubmed ↗
3Harry E, Sinsky C, Dyrbye LN, Makowski MS, Trockel M, Tutty M, et al. Physician Task Load and the Risk of Burnout Among US Physicians in a National Survey. Jt Comm J Qual Patient Saf. 2021;47(2):76–85. doi: 10.1016/j.jcjq.2020.09.011 33168367 · doi ↗ · pubmed ↗
4Chetlen AL, Chan TL, Ballard DH, Frigini LA, Hildebrand A, Kim S, et al. Addressing Burnout in Radiologists. Acad Radiol. 2019;26(4):526–33. doi: 10.1016/j.acra.2018.07.001 30711406 PMC 6530597 · doi ↗ · pubmed ↗
5Smith-Bindman R, Kwan ML, Marlow EC, Theis MK, Bolch W, Cheng SY, et al. Trends in Use of Medical Imaging in US Health Care Systems and in Ontario, Canada, 2000-2016. JAMA. 2019;322(9):843–56. doi: 10.1001/jama.2019.11456 31479136 PMC 6724186 · doi ↗ · pubmed ↗
6Ellington TD, Miller JW, Henley SJ, Wilson RJ, Wu M, Richardson LC. Trends in Breast Cancer Incidence, by Race, Ethnicity, and Age Among Women Aged ≥20 Years - United States, 1999-2018. MMWR Morb Mortal Wkly Rep. 2022;71(2):43–7. doi: 10.15585/mmwr.mm 7102 a 2 35025856 PMC 8757618 · doi ↗ · pubmed ↗
7Grabler P, Sighoko D, Wang L, Allgood K, Ansell D. Recall and Cancer Detection Rates for Screening Mammography: Finding the Sweet Spot. AJR Am J Roentgenol. 2017;208(1):208–13. doi: 10.2214/AJR.15.15987 27680714 · doi ↗ · pubmed ↗
8Larsen M, Aglen CF, Hoff SR, Lund-Hanssen H, Hofvind S. Possible strategies for use of artificial intelligence in screen-reading of mammograms, based on retrospective data from 122,969 screening examinations. Eur Radiol. 2022;32(12):8238–46. doi: 10.1007/s 00330-022-08909-x 35704111 PMC 9705475 · doi ↗ · pubmed ↗