Retrospective evaluation of interval breast cancer screening mammograms by radiologists and AI

Jonas Subelack; Rudolf Morant; Marcel Blum; Axel Gräwingholt; Justus Vogel; Alexander Geissler; David Ehlig

PMC · DOI:10.1007/s00330-025-11833-5·August 4, 2025

Retrospective evaluation of interval breast cancer screening mammograms by radiologists and AI

Jonas Subelack, Rudolf Morant, Marcel Blum, Axel Gräwingholt, Justus Vogel, Alexander Geissler, David Ehlig

PDF

Open Access

TL;DR

This study shows that an AI system can detect breast cancer risk in mammograms that radiologists may have missed, potentially improving screening accuracy.

Contribution

The study provides evidence that AI can identify high-risk interval breast cancer cases that may be overlooked by radiologists.

Findings

01

The AI system assigned higher risk scores to potentially missed breast cancer cases compared to those without abnormalities.

02

13.4% of interval breast cancers without visible abnormalities received a high AI risk score, suggesting AI can detect hidden risks.

03

45.2% of high-risk AI cases were not discussed in consensus conferences, indicating potential for AI to improve screening outcomes.

Abstract

To determine whether an AI system can identify breast cancer risk in interval breast cancer (IBC) screening mammograms. IBC screening mammograms from a Swiss screening program were retrospectively analyzed by radiologists/an AI system. Radiologists determined whether the IBC mammogram showed human visible signs of breast cancer (potentially missed IBCs) or not (IBCs without retrospective abnormalities). The AI system provided a case score and a prognostic risk category per mammogram. 119 IBC cases (mean age 57.3 (5.4)) were available with complete retrospective evaluations by radiologists/the AI system. 82 (68.9%) were classified as IBCs without retrospective abnormalities and 37 (31.1%) as potentially missed IBCs. 46.2% of all IBCs received a case score ≥ 25, 25.2% ≥ 50, and 13.4% ≥ 75. Of the 25.2% of the IBCs ≥ 50 (vs. 13.4% of a no breast cancer population), 45.2% had not been…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases2

breast cancer IBC

Figures5

Click any figure to enlarge with its caption.

Derivation of the final data set. Some of the older mammography screening devices (until 2014) are not compatible with the AI system

Share of IBC cases above a certain case score threshold

Funding1

—Cancer League of Eastern Switzerland

Keywords

BreastBreast neoplasmsMammographyArtificial intelligenceInterval breast cancer

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGlobal Cancer Incidence and Screening · AI in cancer detection · Radiomics and Machine Learning in Medical Imaging

Full text

Introduction

Research demonstrated that participation in a mammography screening program (MSP) leads to a reduction in breast cancer (BC) related mortality, underscoring the crucial role of early detection [1–3]. Switzerland has neither a nationwide MSP nor MSPs in all federal states (cantons), but recent research also validated these beneficial findings in the Swiss context when comparing cantons with and without an MSP [4]. While the overall goal of MSPs is to detect BC as early as possible and thus to reduce mortality, BC may go undetected. For instance, this could be due to the limited ability to detect abnormalities in dense breast tissue, specific tumor development pathways that only manifest slowly and with minor signs, and radiologist errors in either visual perception or interpretation [5, 6]. Interval breast cancers (IBC) are cases where BC was diagnosed within the screening interval [7]. IBCs are concerning, as they are associated with more aggressive tumor characteristics, higher mortality rates, and require more frequently invasive treatments than screen-detected BC [7–9]. While the occurrence of IBCs is not completely avoidable due to the natural progression of cancers between screenings, minimizing their incidence by higher screening sensitivity is vital for improving the effectiveness of MSPs. Thus, the rate of IBC cases is a key quality indicator of MSPs [7, 10].

In general, the number of IBCs could be reduced by shortening the intervals (for women at higher risk) or by using other imaging modalities (e.g., MRI), which comes with workload increase, higher costs for an MSP and higher burden for affected women [11]. The advent of artificial intelligence (AI) presents a promising opportunity to enhance early detection of BC within MSPs without necessarily increasing workload. Multiple studies outlined the potential benefit of AI systems to improve the accuracy and/or efficiency of MSPs in BC detection [12–17]. However, few studies to date have tested the ability of AI to reduce IBCs [6, 14, 18]. Specifically, “the UK National Screening Committee review cited the lack of consistent evidence on the performance of AI algorithms in the detection of interval cancers as a major reason for not recommending the integration of AI into routine mammography screening workflows” [19, 20]. Further, neither the specific AI system has been tested for its potential to reduce IBCs, nor has such a study been conducted in Switzerland. Thus, our study aims to answer the research question: Can an AI system identify BC signs in IBC screening mammograms? Additionally, we presented the IBC screening mammograms independently to three radiologists to retrospectively evaluate whether there were any visible BC signs.

Materials and methods

Ethics approval

This retrospective cohort study was submitted to the Ethics Commission of Eastern Switzerland (EKOS–23/061). It was ruled to be a quality control study covered by the informed patient consent of the MSP. All data were pseudonymized by the Cancer Registry of Eastern Switzerland prior to analyses.

The study followed the STARD guideline for reporting diagnostic accuracy studies [21].

Study population

The study included all IBC cases of women who participated in the MSP “donna” (double reading) in the Swiss cantons of St.Gallen and Grisons (10% of Swiss inhabitants) between 2010 and 2019. All women between 50 and 69 years of age in the cantons of St.Gallen (since 2010) and Grisons (since 2011) were invited every 2 years to voluntarily participate in the MSP, which is covered by the compulsory health insurance except for a co-payment of less than 20 USD [22]. A recorded tumor was classified as IBC when an invasive or in situ BC (ICD-10-codes: C50/D05 in the cancer registry) was diagnosed within 24 months after a normal (i.e., no breast cancer diagnosis, including cases that have been discussed in a consensus conference/further investigated) screening mammogram (similar as in Niraula et al [7]/the European guidelines [10]). Cancer data were retrieved from the databases of the Cancer Registries of Eastern Switzerland and Grisons-Glarus, documenting all cancer cases within their cantons.

AI system

We utilized the AI system ProFound AI® from iCAD (version 3.1 for 2D FFDM), specifically the case score and risk score/category. The case score reflects the degree of confidence that a mammogram currently includes a BC compared to the training database on a scale from 0 (BC very unlikely) to 100 (BC very likely) [23]. The numerical risk score reflects the risk (probability) of being diagnosed with BC within the next 2 years. This forecasted risk is based on age, regional incidence data and numerous mammographic characteristics and is also expressed on a scale of 0 to 100 [24]. Additionally, the numerical risk score is used to classify a mammogram into one of four risk categories (as defined by vendor): low (< 0.15%), general (0.15–0.59%), moderate (0.6–1.59%), or high (≥ 1.6%) in reference to the average risk of a woman’s age.

Sample generation and data processing

From 151,233 women who participated in screening in the MSP “donna” from 2010 to 2019, 268 IBCs have been identified. Of the 268 IBCs, 242 IBC screening mammograms were independently provided to the radiologists and the AI system (after excluding LCIS cases, secondary IBC cases, and a tumor without a defined UICC-TNM stage). 119 IBC screening mammograms have been interpretable by the AI system (i.e., the other mammograms could not be analyzed because the stored mammograms have been performed by mammography screening devices that are not compatible with the AI system). Figure 1 shows the derivation of the final sample. This includes variables from the MSP, the cancer registry, and the outcomes from the AI system. MSP data included, for example, the year of screening, breast density (Breast Imaging Reporting and Data System [25] [BI-RADS] a to d), both radiologists’ reviews (R1 to R5), consensus conference conclusions (BC ruled out or further investigation), further conducted examinations (e.g., MRI), overall MSP outcome (BC detection/cancer ruled out) and self-declared information as family BC history (mother, sister or daughter diagnosed with BC). Further, cancer registry data included diagnosis and staging (e.g., ICD-10, TNM classifications), individual information (e.g., age at diagnosis), cancer biology information (e.g., Ki-67 proliferation index), treatment information (e.g., mastectomy), medical follow-up (e.g., recurrence), and vital status. Women with LCIS (i.e., D05.0) were excluded from the analysis as LCIS is a benign condition and not treated as a carcinoma [26].Fig. 1. Derivation of the final data set. Some of the older mammography screening devices (until 2014) are not compatible with the AI system

Retrospective evaluation of the IBC mammograms by radiologists (no AI involvement)

Three radiologists (two Swiss, one German, all with 10+ years’ experience, actively reviewing breast cancer screening mammograms in MSPs, and no exposure to AI in regular screening) were each provided with all IBC screening mammograms and 90 normal screening mammograms without BC that have been randomly selected by the cancer registry data analyst and mixed between the IBC mammograms to reduce the bias of radiologists, not to assume that they can always detect something. The blind review methodology, first looking at the screening mammogram without knowing the localization of the BC, is aligned with the European guidelines [10]. The radiologists independently assessed the mammograms, and whenever one or more radiologists assigned a mammogram with R3 or higher, the case was discussed during a consensus conference. Here, the screening mammograms were jointly reviewed and discussed, and a final decision was made on whether further examination was warranted due to visible abnormalities indicating malignancy. To classify the IBCs into the standardized categories (true interval, occult, minimal signs, false negative), the European guidelines require that the retrospective review evaluate first the screening mammogram and then the diagnostic mammogram. As we did not have access to the diagnostic mammograms, these categories could not be utilized. A reduced classification was sufficient to address our research question, distinguishing between “IBCs without retrospective abnormalities” and “potentially missed IBCs” if the consensus conference recommended further investigations.

Statistical analyses

A descriptive table outlines patient, MSP, AI system output and cancer characteristics, stratified according to the “IBCs without retrospective abnormalities” and “potentially missed IBC” subgroups. ANOVA and Student’s t-test, as appropriate, highlight the significance of differences in patient-level variables, including tumor characteristics, between the subgroups. Diagrams visualize the distribution of case scores and risk classifications of the two subgroups, additionally stratified by breast density. A scatterplot chart illustrates the two key AI outputs in a graph, with the additional information on whether these IBC cases have been discussed in a consensus conference. Line graphs and respective sensitivity tables stress the share of IBC cases that would have been classified as abnormal based on the respective AI case score threshold/within the respective risk category. Results were considered significant at the 95% confidence level. All statistical analyses were performed in Stata 18.5 [27].

Results

Table 1 shows that the women with IBC were on average 57.3 (standard deviation (SD): 5.4) years old, born in Switzerland (73.1%; n = 87), from the canton St.Gallen (80.7%; n = 96), and had a high breast density (BI-RADS c/d: 63:0%; n = 75). The retrospective review by radiologists of the screening mammograms resulted in 68.9% (n = 82) IBCs without retrospective abnormalities and 31.1% (n = 37) potentially missed IBCs. There were no significant differences in evaluated patient characteristics between the two subgroups. Potentially missed IBC received significantly higher radiologist reviews compared to the IBCs without retrospective abnormalities during the previous MSP (BI-RADS 3-5: 62.2% vs. 8.5%) and within the retrospective evaluations (BI-RADS 3-5: 100% vs. 51.2%; p < 0.05). During the previous screening, 40.5% (n = 15) of potentially missed IBCs (vs. 1.2% of IBCs without retrospective abnormalities) were investigated further (including some biopsies). The retrospective evaluation suggested further investigations for all potentially missed IBC cases.Table 1. Patient and tumor characteristicsAll IBCsIBCs without retrospective abnormalitiesPotentially missed IBCsTest for differencesNMean (SD)/proportionNMean (SD)/proportionNMean (SD)/proportionp-valuePatient characteristicsAge11957.3 (5.4)8256.8 (5.0)3758.4 (6.2)p = 0.15^g^Birthplace (%)p = 0.98^h^ Outside Switzerland3226.89%2226.83%1027.03% In Switzerland8773.11%6073.17%2772.97%Canton (%)p = 0.67^h^ Grisons2319.33%1518.29%821.62% St.Gallen9680.67%6781.71%2978.38%Breast density^a^p = 0.20^h^ BI-RADS a54.20%22.44%38.11% BI-RADS b3932.77%2732.93%1232.43% BI-RADS c7159.66%4959.76%2259.46% BI-RADS d43.36%44.88%00.00%Family history of BC^b^p = 0.19^h^ Positive3226.89%2530.49%718.92% Negative8773.11%5769.51%3081.08%Initial MSP resultsHighest radiologist review^a^11932.8 (2.8)8223.11 (2.6)3754.13 (5.2)p < 0.05^h^ R15747.90%5162.20%616.22% R23226.89%2429.27%821.62% R32117.65%56.10%1643.24% R486.72%22.44%616.22% R510.84%00.00%12.70%Consensus conferencesp < 0.05^h^ Yes3126.05%89.76%2362.16% No8873.95%7490.24%1437.84%Further investigations^c^p < 0.05^h^ Yes1613.45%11.22%1540.54% No10386.55%8198.78%2259.46%Retrospective radiologist evaluationsHighest radiologist review^d^p < 0.05^h^ R100.00%00.00%00.00% R24033.61%4048.78%00.00% R31815.13%1720.73%12.70% R45445.38%2429.27%3081.08% R575.88%11.22%616.22%Consensus conferencesp < 0.05^h^ Yes6453.78%2732.93%37100% No5546.22%5567.07%00.00%Further investigations^e^p < 0.05^h^ Yes3831.93%11.22%37100% No8168.07%8198.78%00.00%AI outputCase scores11932.8 (30.0)8223.1 (23.8)3754.1 (31.6)p < 0.05^g^ ≥ 0 and < 103932.77%3745.12%25.41% ≥ 10 and < 201714.29%1214.63%513.51% ≥ 20 and < 301210.08%89.76%410.81% ≥ 30 and < 401210.08%910.98%38.11% ≥ 40 and < 5097.56%56.10%410.81% ≥ 50 and < 6032.52%22.44%12.70% ≥ 60 and < 7086.72%33.66%513.51% ≥ 70 and < 8054.20%22.44%38.11% ≥ 80 and < 9054.20%33.66%25.41% ≥ 90 and ≤ 10097.56%11.22%821.62%Risk classificationp < 0.05^h^ Low00.00%00.00%00.00% General4638.66%3947.56%718.92% Moderate4134.45%3137.80%1027.03% High3025.21%1214.63%1848.65% n.a.21.68%00.00%25.41%Cancer dataStage distribution (%)p = 0.25^h^ In situ86.72%78.54%12.70% I4134.45%2834.15%1335.14% II4336.13%2935.37%1437.84% III1512.61%1315.85%25.41% IV119.24%44.88%718.92% n.a.10.84%11.22%00.00%Occurrencep = 0.86^h^ Within 6 months1411.76%1012.20%410.81% Between 7 and 12 months3025.21%1923.17%1129.73% Between 13 and 18 months3630.25%2834.15%821.62% Between 19 and 24 months3932.77%2530.49%1437.84%Mean tumor size (mm)11920.9 (16.4)8222.0 (17.3)3718.5 (14.1)p = 0.28^g^ Tumor size < 2 cm6252.10%4048.78%2259.46% Tumor size < 5 cm11092.44%7591.46%3594.59%Lymph node involvementp = 0.06^h^ Positive (N1+)2722.69%2125.61%616.22% Negative (N0)6857.14%5060.98%1848.65% n.a.2420.17%1113.41%1335.14%Grading^f^p = 0.25^h^ Grade I1613.45%910.98%718.92% Grade II5546.22%3947.56%1643.24% Grade III3932.77%2631.71%1335.14% n.a.97.56%89.76%12.70%Hormone receptorp = 0.74^h^ Positive9680.67%6579.27%3183.78% Negative2016.81%1417.07%616.22% n.a.32.52%33.66%00.00%Estrogen receptorsp = 0.38^h^ Positive (≥ 1)9680.67%6579.27%3183.78% Negative2016.81%1417.07%616.22% n.a.32.52%33.66%00.00%Progesterone receptorsp = 0.39^h^ Positive (≥ 1)8067.23%5465.85%2670.27% Negative3529.41%2429.27%1129.73% n.a.43.36%44.88%00.00%HER2 overexpressionp = 0.91^h^ Positive2420.17%1821.95%616.22% Negative8974.79%5971.95%3081.08% n.a.65.04%56.10%12.70%Triple negative1210.08%910.98%38.11%p = 0.59^h^Ki-67 proliferation indexp = 0.83^h^ Low (< 10%)1613.45%910.98%718.92% Medium (≥ 10%, < 25%)4436.97%3137.80%1335.14% High (≥ 25%)5042.02%3542.68%1540.54% n.a.97.56%78.54%25.41%^a^ According to BI-RADS, the highest review from two initial readers, and some random cases are also evaluated by a third reader for training purposes^b^ Mother, sister or daughter diagnosed with BC^c^ Diverse further investigation options as another mammogram, ultrasound, MRI to biopsy^d^ According to BI-RADS, the highest review from all three radiologists participating in the retrospective evaluation of IBC mammograms^e^ Further investigations suggested by a retrospective consensus conference due to abnormal findings in screening mammogram^f^ According to the Union for International Cancer Control^g^ Student’s t-test^h^ Anova

The AI system rated the IBC screening mammograms with much higher case scores compared to a representative non-cancer reference group and the normal mammograms of this study (see Fig. 2a). The potentially missed IBCs received significantly higher case scores compared to the IBCs without retrospective abnormalities (mean case score: 54.1 (5.2) vs. 23.1 (2.6); p < 0.05; Table 1). Examining IBC mammograms by breast density of women, it is evident that case scores are higher for low breast density, especially for potentially missed IBCs with low breast density (see Fig. 2b, c). Further, the AI system assigned the IBC screening mammograms to higher risk categories compared to a general reference group (see Fig. 2d). The potentially missed IBCs have been assigned to significantly higher risk categories compared to the IBCs without retrospective abnormalities (high-risk category: 48.7% vs. 14.6%; p < 0.05; Table 1).Fig. 2a Relative distribution of case scores. Source for non-cancer reference group: iCAD Inc. 2023 [24]. b Relative distribution of case scores: Low breast density subgroup. c Relative distribution of case scores: High breast density subgroup. d Relative distribution of risk categories. Source for general reference group: Eriksson et al [12]; General reference group including normal, screen-detected, and interval breast cancer screening mammograms from a representative Swedish screening population

The combined distribution of case and risk score (see Fig. 3a) highlights that many IBC cases are in the low to medium case and risk score area and shows a noticeable association between case and risk score, with some exceptions. 46.2% of all IBCs received a case score above 25, 25.2% above 50, and 13.4% above 75 (Table 2). Figure 3b outlines the distribution of case and risk score for the two subgroups, where it is evident that many IBCs without retrospective abnormalities have very low case and risk scores, whereas the potentially missed IBCs are spread widely from low to high case and risk scores. 73.0% of potentially missed IBCs (vs. 34.1% of IBCs without retrospective abnormalities) received a case score above 25, 51.4% (vs. 13.4%) above 50, and 29.7% (vs. 6.1%) above 75. In addition, Fig. 3c highlights visually by subgroup whether the individual IBC cases have been discussed in a consensus conference within the previous MSP. Here, it is visible that there is a considerable number of IBCs without retrospective abnormalities, as well as potentially missed IBCs that have a high case-/risk score that have not been discussed in a consensus conference before. Specifically, 54.8% of all IBC (60.9% potentially missed IBCs vs. 37.5% IBCs without retrospective abnormalities) received a case score above 50 and have been discussed during a consensus conference during the MSP; similarly, 22.6% of all IBC (39.1% vs. 0%) received a case score above 75 and have been discussed (Table 2).Fig. 3a Distribution of case and risk score for all IBC cases. b Distribution of case and risk score stratified by subgroup. c Distribution of case and risk score stratified by subgroup and consensus conferenceTable 2Accumulated IBC cases per case score thresholdGeneral distributiono/w consensus conferenceo/w further investigatedCase scoresAll IBCs (n = 119)No retr. abn. (n = 82)Pot. missed (n = 37)All IBCsNo retr. abn.Pot. missedAll IBCsNo retr. abn.Pot. missed**≥ 955.9%1.2%16.2%12.9%0.0%17.4%18.8%0.0%20.0%≥ 907.6%1.2%21.6%12.9%0.0%17.4%18.8%0.0%20.0%≥ 859.2%3.7%21.6%12.9%0.0%17.4%18.8%0.0%20.0%≥ 8011.8%4.9%27.0%19.4%0.0%26.1%25.0%0.0%26.7%≥ 7513.4%6.1%29.7%22.6%0.0%30.4%31.3%0.0%33.3%≥ 7016.0%7.3%35.1%29.0%0.0%39.1%37.5%0.0%40.0%≥ 6521.8%9.8%48.6%48.4%25.0%56.5%50.0%0.0%53.3%≥ 6022.7%11.0%48.6%48.4%25.0%56.5%50.0%0.0%53.3%≥ 5522.7%11.0%48.6%48.4%25.0%56.5%50.0%0.0%53.3%≥ 5025.2%13.4%51.4%54.8%37.5%60.9%62.5%100%60.0%≥ 4531.1%17.1%62.2%61.3%50.0%65.2%68.8%100%66.7%≥ 4032.8%19.5%62.2%61.3%50.0%65.2%68.8%100%66.7%≥ 3538.7%26.8%64.9%64.5%50.0%69.6%68.8%100%66.7%≥ 3042.9%30.5%70.3%67.7%50.0%73.9%75.0%100%73.3%≥ 2546.2%34.1%73.0%74.2%62.5%78.3%75.0%100%73.3%≥ 2052.9%40.2%81.1%83.9%75.0%87.0%87.5%100%86.7%≥ 1560.5%50.0%83.8%83.9%75.0%87.0%87.5%100%86.7%≥ 1067.2%54.9%94.6%87.1%75.0%91.3%93.8%100%93.3%≥ 5**84.9%78.0%100%96.8%87.5%100%100%100%100%“No retr. abn.”: IBCs without retrospective abnormalities; “Pot. missed”: potentially missed IBCs; “o/w consensus conference”: IBCs that have been discussed within the previous MSP consensus conference; “o/w further investigated”: IBCs that have been further investigated within the previous MSP

Discussion

Our study aims to assess whether an AI system can identify BC signs in IBC screening mammograms and thus reduce the number of IBCs in an MSP. Therefore, we retrospectively analyzed the IBC cases from the Swiss MSP “donna” from 2010 to 2019 using a retrospective radiologist review and an AI system. When reviewing the IBC characteristics of the study population, we found that they are in line with previous studies (e.g., high breast density) [5, 6, 28]. It is noteworthy that the radiologists observed abnormalities in 26.1% of all IBCs already during the previous screening, discussed these cases during a consensus conference, and ultimately investigated 13.5% of all IBCs further without detecting BC. In comparison, in the screening routine of the MSP “donna,” usually only around 10% of all cases are discussed in a consensus conference. Similarly, other international studies also found higher rates of consensus meetings for IBCs [29–31]. The radiologists’ retrospective assessments resulted in 68.9% IBCs without retrospective abnormalities and 31.1% potentially missed IBCs. Other studies come to similar shares of around 30% missed IBCs, but the specific proportion of IBCs retrospectively classified as missed significantly depends on the review methodology (e.g., blinded vs. fully informed review) [32, 33].

The distribution of case scores for all IBC cases indicates that the AI system identified abnormalities in a relevant portion of the IBC screening mammograms. In detail, the AI system assigned overall high case scores to potentially missed IBCs (51.4% with a case score above 50), where radiologists retrospectively also observed abnormalities. The high case scores for potentially missed IBCs highlight a high congruence where both the radiologists and the AI system detected abnormalities in the screening mammograms, even though these cases have not been identified in the MSP. In addition, the AI system recognized a fair share of IBCs without retrospective abnormalities (13.4% with a case score above 50), where no abnormalities were seen by the radiologists, indicating its complementary value. In practice, the AI system was recently introduced as a third reader in the MSP “donna,” with a case score threshold of 55, above which all cases are being discussed in a consensus conference (based on local historical data to discuss a maximum of 15% of all MSP cases in a consensus conference). When applying the case score threshold of 55 to our IBC population, 22.7% of the IBC cases get flagged by the AI system, and 51.6% of these cases were not discussed in a consensus conference during the previous screening. Thus, 11.7% of the IBC cases would now additionally be discussed in a consensus conference.

While there is only limited evidence so far, one retrospective study from Sweden concluded that their AI system (Transpara) might aid radiologists in detecting up to 19.3% of the IBCs at screening where the mammogram showed at least minimal signs of malignancy [6]. One study from Denmark tested different deployment scenarios of an AI system (Lunit) within an MSP and concluded that a triage-based approach provides the best performance metrics (e.g., accuracy, workload) when evaluating screen-detected and IBC screening mammograms [34]. Depending on the deployment scenario, between 13.3% (triage) and 19.5% (replacing one radiologist) of IBC cases would have been flagged by the respective AI system. Therefore, 25.2% of our IBCs with a case score above 50 are in line with these findings and at the upper end, primarily driven by potentially missed IBCs with overall much higher case scores (see Fig. 4).Fig. 4. Share of IBC cases above a certain case score threshold

The evaluation of the risk scores/-categories led to similar findings (see Table 3). Relative to a representative Swedish screening population, women with IBCs were categorized at much higher risk categories by the AI system (25.6% vs. 9% with high-risk) [12]. When comparing potentially missed IBCs with IBCs without retrospective abnormalities, the AI assigned significantly higher risk categories (51.4% vs. 14.6% within the high-risk category) to the potentially missed IBCs, aligning with the radiologists’ retrospective findings of abnormalities in these mammograms. One previous study identified that the same AI system as in our study and its risk forecasting component can predict around 30% of stage 2+ BCs in 6% of high-risk women, where no BC was detected during the MSP [12]. With 25.6% of all our IBCs being classified as high-risk, where later IBC was diagnosed across all stages, this figure is directionally in line with these findings. Another study retrospectively reviewed the screening mammograms by radiologists and a different AI system (Transpara) and found that there were 33% of IBCs that were classified as “true negative” and had been assigned the highest AI risk score [6]. Additionally, a study employed an AI system to analyze normal/negative MSP mammograms for BC risk. The ones with the highest risk underwent an additional MRI examination, resulting in a much higher efficiency in cancer detection compared to conventional density measurements and risk models [35].Table 3IBC cases per risk categoryGeneral distributiono/w consensus conferenceo/w further investigatedRisk categoryAll IBCs (n = 117)No retr. abn. (n = 82)Pot. missed (n = 35)All IBCs (n = 31)No retr. abn. (n = 8)Pot. missed (n = 23)All IBCs (n = 16)No retr. abn. (n = 1)Pot. missed (n = 15)High25.6%14.6%51.4%41.9%12.5%52.2%62.5%0.0%66.7%Moderate35.0%37.8%28.6%35.5%50.0%30.4%18.8%100%13.3%General39.3%47.6%20.0%22.6%37.5%17.4%18.8%0.0%20.0%“No retr. abn.”: IBCs without retrospective abnormalities; “Pot. missed”: potentially missed IBCs; “o/w consensus conference”: IBCs that have been discussed within the previous MSP consensus conference; “o/w further investigated”: IBCs that have been further investigated within the previous MSP

The joint examination of both AI information and the radiologist assessments outlines that there are many IBCs where neither the AI system nor the radiologists could find evidence for BC in the screening mammograms. However, it also highlights that there are many IBC screening mammograms where both AI systems’ outputs (moderate to high case and risk score) flagged abnormalities/indicated risk. Moreover, there are some IBCs with just a moderate to high case score (but low to moderate risk category) or just a moderate to high-risk category (but low to moderate risk case score). Thus, it is valuable to consider both AI system information.

Limitations

Our retrospective study design inherently limits our ability to draw causal conclusions and emphasizes the need for prospective studies. It is especially important to evaluate how the radiologists utilize the AI information in a consensus conference, where they do not see any abnormalities in the mammogram, and a decision needs to be made if/which further investigations (e.g., MRI, ultrasound, or contrast-enhanced mammography) should be conducted. Further, without the diagnostic mammogram or the detailed position of the diagnosed BC, we could not evaluate whether the AI-marked lesion corresponds to the later detected BC.

Future research needs to include a large representative sample of mammograms (i.e., screening-detected BC, IBC, normal mammograms) to substantiate the sensitivity and specificity of the AI system. This is important to define a practically relevant AI case-/risk score threshold, as only a fully representative sample can ensure that these thresholds are consistent with the overall optimization challenge (i.e., identification of the most possible BC cases at screening while avoiding excessive caseloads for consensus conferences/ultimately decreasing screening efficiency).

Conclusion

Our research highlights that an AI system can identify BC signs in relevant portions of IBC screening mammograms and thus potentially reduce the number of IBCs in an MSP that currently does not utilize an AI system. It can do so by identifying most IBC cases that are retrospectively visible for radiologists but have been missed during the previous screening (without AI support), thereby acting as a safety net for human error. Additionally, it can identify some IBCs that are not visible to humans (IBCs without retrospective abnormalities). The specific proportion of avoidable IBC cases depends on the selected case score and/or the respective risk classification threshold, by which a case is flagged. Overall, our study demonstrates the potential positive impact of a different AI system in a new population, Switzerland. Further research is required to prospectively assess how AI information is utilized to ultimately diagnose BC, especially when humans do not see any abnormalities.

Bibliography3

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Perry N, Broeders M, de Wolf C, Törnberg S, Holland R, von Karsa L (2008) European guidelines for quality assurance in breast cancer screening and diagnosis. Fourth edition—summary document. Ann Oncol 19:614–62210.1093/annonc/mdm 48118024988 · doi ↗ · pubmed ↗
2Lauritzen AD, Lillholm M, Lynge E, Nielsen M, Karssemeijer N, Vejborg I (2024) Early indicators of the impact of using AI in mammography screening for breast cancer. Radiology. 10.1148/radiol.23247910.1148/radiol.23247938832880 · doi ↗ · pubmed ↗
3Krebsliga Ostschweiz (2024) donna brochure https://www.donna-programm.ch/fileadmin/data/downloads/de-downloads/20220927_Donna_Broschuere_DE_Web.pdf