Interobserver variability of recall decisions between mammography readers in the English NHS breast screening programme: A comparison of interobserver variability measures

Laura Quinn; David Jenkinson; Sian Taylor-Phillips; Yemisi Takwoingi; Alice Sitch

PMC · DOI:10.1016/j.ejrad.2026.112723·April 1, 2026

Interobserver variability of recall decisions between mammography readers in the English NHS breast screening programme: A comparison of interobserver variability measures

Laura Quinn, David Jenkinson, Sian Taylor-Phillips, Yemisi Takwoingi, Alice Sitch

PDF

Open Access

TL;DR

This study compares how consistently mammogram readers decide to recall women for further testing in breast cancer screening, highlighting issues with commonly used statistical measures.

Contribution

The paper evaluates and compares different measures of interobserver variability in mammography recall decisions, emphasizing the limitations of Cohen’s kappa in low-prevalence settings.

Findings

01

Percentage agreement, Gwet’s AC, and PABAK showed lower agreement for first screening appointments compared to subsequent ones.

02

Cohen’s kappa was found to be heavily distorted by outcome prevalence, making it unsuitable for low-prevalence screening settings.

03

Measures like Gwet’s AC and PABAK were more informative for assessing variability in challenging screening scenarios.

Abstract

To evaluate interobserver variability between mammogram readers’ recall decisions in the English NHS breast screening programme, comparing different variability measures. Data from 401,682 women in 22 NHS centres who underwent mammographic screening interpreted independently by two mammogram readers were included. Percentage agreement, prevalence-adjusted bias-adjusted-kappa (PABAK), Gwet’s agreement coefficient (Gwet’s AC) and Cohen’s kappa were reported with 95% confidence intervals. Analyses were performed separately for women at first and subsequent screening appointments, by cancer diagnosis, reader recall rates and age group. Of 86,287 women at first screening, 6,491 (7.5%) were recalled, compared to 9,488 (3.0%) of 315,395 at subsequent screenings. Percentage agreement, Gwet’s AC, and PABAK were lower for first screening than subsequent (93.6%, 95%CI: 93.4–93.7 vs 97.2%, 95%CI:…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases2

breast cancer cancer

Figures3

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReliability and Agreement in Measurement · Global Cancer Incidence and Screening · Radiology practices and education