# Evaluation of large Language models on pediatric asthma: a comparative study of Claude3-Opus, Gemini 2.0, ChatGPT-4o, and DeepSeek—a cross-sectional questionnaire study

**Authors:** Ying-qi Hang, Jie Wu, Li Bai, Mingyun Wu, Jianer Yu, Liang Li, Xiang Piao

PMC · DOI: 10.1186/s12911-026-03371-x · BMC Medical Informatics and Decision Making · 2026-02-10

## TL;DR

This study compares how well four AI models provide information on pediatric asthma, finding that while they are accurate, their readability is too high for patients.

## Contribution

The study evaluates the readability and clinical quality of AI-generated information specifically for pediatric asthma, highlighting the need for better accessibility.

## Key findings

- All four LLMs provided similar quality information on pediatric asthma, scoring in the 'fair-to-good' range.
- ChatGPT-4o generated significantly more readable content than DeepSeek, which performed worse than all others.
- The readability of all models exceeded recommended standards for patient materials, indicating a need for simplification.

## Abstract

Artificial intelligence (AI) has shown potential for enhancing medical practice and improving patient outcomes. However, the efficacy and linguistic accessibility of Large Language Models(LLMs) in pediatric asthma management remain underexplored. This study evaluated the performance of four LLMs in generating clinical information within this domains.

We administrated 15 guideline-based pediatric asthma inquiries to hatGPT-4o, Claude 3 Opus, Gemini 2.0, and DeepSeek. Anonymized responses were independently evaluated by three board-certified pediatric pulmonologists using DISCERN instrument (score range 16–80). Readability was assessed using six standard indices. Inter-rater reliability was measured with intraclass correlation coefficients (ICC). Statistical analysis included repeated measures and post-hoc comparisons with effect size reporting.

No significant difference was found in the overall quality of health information (DISCERN scores) among the four LLMs (F(3,56) = 0.144, p =.933, η² =0.008), with all mean scores clustered within a narrow “fair-to-good” range (50.3–51.9). However, significant differences were observed in readability: ChatGPT-4o generated significantly more comprehensible text than DeepSeek (FRE mean difference = 12.41, p =.005, Cohen’s d = 1.28), while DeepSeek performed significantly worse than all other models (all p <.05). Inter-rater reliability was high (ICC range: 0.849–0.901, all p <.001). Critically, the mean readability level of all outputs (FKGL: 13.2–14.9) far exceeded the recommended reading accessibility level for patient materials.

While current LLMs can provide generally accurate information on pediatric asthma, their outputs exhibit significant limitations in readability for patient-facing use. ChatGPT‑4o shows relative advantages in comprehensibility, yet none meet recommended health-literacy standards. These findings underscore that AI should serve as a supplementary decision‑support tool under clinician supervision, not as a substitute for professional medical advice. Future work should prioritize the integration of adaptive text‑simplification features, validate AI‑generated content in real‑world clinical and caregiver settings, and expand evaluations to include emerging models and diverse chronic disease contexts.

The online version contains supplementary material available at 10.1186/s12911-026-03371-x.

## Linked entities

- **Diseases:** pediatric asthma (MONDO:0005405)

## Full-text entities

- **Diseases:** asthma (MESH:D001249)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12990414/full.md

## References

3 references — full list in the complete paper: https://tomesphere.com/paper/PMC12990414/full.md

---
Source: https://tomesphere.com/paper/PMC12990414