# Detecting Laterality Errors in Combined Radiographic Studies by Enhancing the Traditional Approach With GPT-4o: Algorithm Development and Multisite Internal Validation

**Authors:** Kung-Hsun Weng, Yi-Chen Chou, Yu-Ting Kuo, Tsyh-Jyi Hsieh, Chung-Feng Liu

PMC · DOI: 10.2196/76384 · JMIR Formative Research · 2025-10-29

## TL;DR

This paper introduces a new method combining rule-based systems and GPT-4o to detect laterality errors in radiology reports, showing better performance on real-world data compared to synthetic datasets.

## Contribution

The novel contribution is a clinically deployable ensemble method (rule-based + GPT-4o) for detecting laterality errors in combined radiographic reports using real-world imbalanced data.

## Key findings

- The rule-based+GPT-4o method outperformed other models in detecting laterality errors in real-world imbalanced data.
- Real-world data had a higher laterality error rate in combined reports compared to noncombined reports.
- Performance gaps were observed between synthetic balanced datasets and real-world imbalanced data.

## Abstract

Laterality errors in radiology reports can endanger patient safety. Effective methods for screening for laterality errors in combined radiographic reports, which combine multiple studies into one, remain unexplored.

First, we define and analyze the unstudied combined radiographic report format and its challenges. Second, we introduce a clinically deployable ensemble method (rule-based+GPT-4o), evaluated on large-scale, real-world, imbalanced data. Third, we demonstrate significant performance gaps between real-world imbalanced and synthetic balanced datasets, highlighting limitations of the benchmarking methodology commonly used in current studies.

This retrospective study analyzed deidentified English radiology reports containing laterality terms in order. We split the data into TrainVal (combined training and validation dataset), Test-1 (both real-world, imbalanced), and Test-2 (synthetic, balanced). Test-1 comes from a distinct branch. Experiment 1 compared the baseline, workaround, and GPT-4o-augmented rule-based methods. Experiment 2 compared the rule-based method with the highest recall to fine-tuned RoBERTa, ClinicalBERT, and GPT-4o models.

As of July 2024, our dataset included 10,000 real-world and 889 synthetic radiology reports. The laterality error rate in real-world reports was 1.20% (120/10,000), significantly higher in combined (103/7000, 1.47%) than in noncombined reports (17/3000, 0.57%; difference=0.90%; z=3.81; P<.001). In experiment 1, recall differed significantly among the 3 versions of rule-based methods (Q=6.0; P=.0498, Friedman test). The rule-based+GPT-4o method had the highest recall (average rank=1), significantly better than the baseline (average rank=3; P=.04, Nemenyi test). Most (5/6) of the false positives introduced by the GPT-4o information extraction were due to parser limitations hidden by error cancellation. In experiment 2, on Test-1, rule-based+GPT-4o (precision=0.696; recall=0.889; F1-score=0.780) outperformed GPT-4o (precision=0.219; recall=0.889; F1-score=0.352), ClinicalBERT (precision=0.047; recall=0.667; F1-score=0.088), and RoBERTa (F1-score=0.000). On Test-2, rule-based+GPT-4o (precision=0.996; recall=0.925; F1-score=0.959) and GPT-4o (precision=0.979; recall=0.953; F1-score=0.966) outperformed ClinicalBERT (precision=0.984; recall=0.749; F1-score=0.851) and RoBERTa (F1-score=0.013). Both ClinicalBERT and GPT-4o exhibited notable declines in precision on TrainVal and Test-1 compared to Test-2. Both Test-1 data membership (GPT-4o: odds ratio [OR] 239.89, 95% CI 111.05-518.01; P<.001; ClinicalBERT: OR 1924.07, 95% CI 687.46-5383.99; P<.001) and order count per study (GPT-4o: OR 1.79, 95% CI 1.38-2.31; P<.001; ClinicalBERT: OR 2.50, 95% CI 1.64-3.80; P<.001) independently predicted false positive errors in multivariate logistic regression. In subgroup analysis, all models showed reduced precision and F1 in combined-study subgroups.

The combined radiographic report format poses distinct challenges for both radiology report quality assurance and natural language processing. The combined rule-based and GPT-4o method effectively screens for laterality errors in imbalanced real-world reports. A significant performance gap exists between balanced synthetic datasets and imbalanced real-world data. Future studies should also include real-world imbalanced data.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12612642/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12612642/full.md

## References

32 references — full list in the complete paper: https://tomesphere.com/paper/PMC12612642/full.md

---
Source: https://tomesphere.com/paper/PMC12612642