# Clinicians’ Agreement on Extrapulmonary Radiographic Findings in Chest X-Rays Using a Diagnostic Labelling Scheme

**Authors:** Lea Marie Pehrson, Dana Li, Alyas Mayar, Marco Fraccaro, Rasmus Bonnevie, Peter Jagd Sørensen, Alexander Malcom Rykkje, Tobias Thostrup Andersen, Henrik Steglich-Arnholm, Dorte Marianne Rohde Stærk, Lotte Borgwardt, Sune Darkner, Jonathan Frederik Carlsen, Michael Bachmann Nielsen, Silvia Ingala

PMC · DOI: 10.3390/diagnostics15070902 · Diagnostics · 2025-04-01

## TL;DR

This study shows that clinicians consistently annotate non-lung findings in chest X-rays using a standard labeling system, regardless of their experience level.

## Contribution

The study introduces a diagnostic labeling scheme that ensures reliable extrapulmonary annotation by clinicians of varying experience.

## Key findings

- High overall agreement was observed across all experience levels using PABAK values.
- Novice and experienced clinicians showed significant differences in specific label agreements.
- Annotations remained stable between two rounds, confirming reliability.

## Abstract

Objective: Reliable reading and annotation of chest X-ray (CXR) images are essential for both clinical decision-making and AI model development. While most of the literature emphasizes pulmonary findings, this study evaluates the consistency and reliability of annotations for extrapulmonary findings, using a labelling scheme. Methods: Six clinicians with varying experience levels (novice, intermediate, and experienced) annotated 100 CXR images using a diagnostic labelling scheme, in two rounds, separated by a three-week washout period. Annotation consistency was assessed using Randolph’s free-marginal kappa (RK), prevalence- and bias-adjusted kappa (PABAK), proportion positive agreement (PPA), and proportion negative agreement (PNA). Pairwise comparisons and the McNemar’s test were conducted to assess inter-reader and intra-reader agreement. Results: PABAK values indicated high overall grouped labelling agreement (novice: 0.86, intermediate: 0.90, experienced: 0.91). PNA values demonstrated strong agreement on negative findings, while PPA values showed moderate-to-low consistency in positive findings. Significant differences in specific agreement emerged between novice and experienced clinicians for eight labels, but there were no significant variations in RK across experience levels. The McNemar’s test confirmed annotation stability between rounds. Conclusions: This study demonstrates that clinician annotations of extrapulmonary findings in CXR are consistent and reliable across different experience levels using a pre-defined diagnostic labelling scheme. These insights aid in optimizing training strategies for both clinicians and AI models.

## Full-text entities

- **Diseases:** AI (MESH:C538142), TB (MESH:D014376), cardiomegaly (MESH:D006332), injury to (MESH:D014947), pain (MESH:D010146), fractures (MESH:D050723), lung pathologies (MESH:D008171), lung cancer (MESH:D008175), emphysema (MESH:D004646), bone (MESH:D001847)
- **Chemicals:** PNA (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11988848/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11988848/full.md

## References

47 references — full list in the complete paper: https://tomesphere.com/paper/PMC11988848/full.md

---
Source: https://tomesphere.com/paper/PMC11988848