# Large Language Models Perform at Chance Level in the Diagnosis of Pediatric Pneumonia Using Chest Radiographs

**Authors:** Justin Gillette, Michelle Lu, Thomas F Heston

PMC · DOI: 10.7759/cureus.92596 · Cureus · 2025-09-17

## TL;DR

This study finds that general-purpose AI models like ChatGPT and Gemini are not reliable for diagnosing pediatric pneumonia from chest X-rays, performing at chance level.

## Contribution

The study is the first to evaluate the diagnostic performance of general-purpose large language models on pediatric chest radiographs for pneumonia.

## Key findings

- LLMs performed at chance level (31% accuracy) in classifying pediatric chest X-rays for pneumonia.
- Accuracy was highest for viral pneumonia (54%) and lowest for normal X-rays (18%).
- LLMs showed poor internal consistency and low agreement with human experts.

## Abstract

Introduction

Pneumonia remains a significant cause of morbidity and mortality in children globally. Chest radiographs (CXRs) are widely used to diagnose pediatric pneumonia; however, distinguishing between bacterial and viral etiologies on imaging is a diagnostically challenging task. Large language models (LLMs), particularly those integrated with vision capabilities, have shown promise in preliminary studies for interpreting CXR findings. However, the diagnostic performance of general-purpose LLMs without specialized medical training or add-ons remains poorly understood. This study examined whether such LLMs could independently and reliably distinguish between bacterial, viral, and normal CXRs in pediatric patients.

Methods

We evaluated four publicly available LLMs, such as ChatGPT o3, Claude 3.7 Sonnet, Gemini 2.5 Pro, and Grok 3, on a dataset of 44 pediatric CXRs confirmed by human readers to show bacterial pneumonia (n = 17), viral pneumonia (n = 13), or no abnormality (n = 14). Each image was analyzed twice by each LLM using a standardized prompt, resulting in a total of eight readings per image. Diagnostic accuracy was assessed relative to human expert consensus. Internal consistency was measured by comparing repeated interpretations. A prespecified adaptive stopping rule was employed based on performance futility criteria. Sample size calculations and statistical analyses were conducted using G*Power.

Results

Across all models and CXR types, the average diagnostic accuracy was 31%, consistent with chance-level performance in a three-choice classification task. Accuracy was highest for viral pneumonia (54%) and lowest for normal CXRs (18%). Internal consistency ranged from 46% to 71% across models, indicating unreliable performance. Concordance with human expert interpretation did not exceed 49% for any of the models. Futility criteria were met after 44 cases, prompting early termination of data collection.

Conclusion

General-purpose LLMs currently available to the public are not reliable diagnostic tools for pediatric pneumonia on chest radiographs. Their accuracy is low, particularly in ruling out disease, and their responses lack internal consistency. These findings highlight the risks associated with deploying such models in unsupervised clinical or consumer-facing settings. Future research should focus on purpose-built radiologic AI tools trained on diverse, clinically representative datasets and integrated with clinician oversight to ensure the safe and effective use of these tools.

## Linked entities

- **Diseases:** pneumonia (MONDO:0005249), bacterial pneumonia (MONDO:0004652), viral pneumonia (MONDO:0006012)

## Full-text entities

- **Diseases:** Pneumonia (MESH:D011014), bacterial pneumonia (MESH:D018410)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12536854/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12536854/full.md

## References

24 references — full list in the complete paper: https://tomesphere.com/paper/PMC12536854/full.md

---
Source: https://tomesphere.com/paper/PMC12536854