Evaluation of large Language models on pediatric asthma: a comparative study of Claude3-Opus, Gemini 2.0, ChatGPT-4o, and DeepSeek—a cross-sectional questionnaire study

Ying-qi Hang; Jie Wu; Li Bai; Mingyun Wu; Jianer Yu; Liang Li; Xiang Piao

PMC · DOI:10.1186/s12911-026-03371-x·February 10, 2026

Evaluation of large Language models on pediatric asthma: a comparative study of Claude3-Opus, Gemini 2.0, ChatGPT-4o, and DeepSeek—a cross-sectional questionnaire study

Ying-qi Hang, Jie Wu, Li Bai, Mingyun Wu, Jianer Yu, Liang Li, Xiang Piao

PDF

Open Access

TL;DR

This study compares how well four AI models provide information on pediatric asthma, finding that while they are accurate, their readability is too high for patients.

Contribution

The study evaluates the readability and clinical quality of AI-generated information specifically for pediatric asthma, highlighting the need for better accessibility.

Findings

01

All four LLMs provided similar quality information on pediatric asthma, scoring in the 'fair-to-good' range.

02

ChatGPT-4o generated significantly more readable content than DeepSeek, which performed worse than all others.

03

The readability of all models exceeded recommended standards for patient materials, indicating a need for simplification.

Abstract

Artificial intelligence (AI) has shown potential for enhancing medical practice and improving patient outcomes. However, the efficacy and linguistic accessibility of Large Language Models(LLMs) in pediatric asthma management remain underexplored. This study evaluated the performance of four LLMs in generating clinical information within this domains. We administrated 15 guideline-based pediatric asthma inquiries to hatGPT-4o, Claude 3 Opus, Gemini 2.0, and DeepSeek. Anonymized responses were independently evaluated by three board-certified pediatric pulmonologists using DISCERN instrument (score range 16–80). Readability was assessed using six standard indices. Inter-rater reliability was measured with intraclass correlation coefficients (ICC). Statistical analysis included repeated measures and post-hoc comparisons with effect size reporting. No significant difference was found in…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases2

pediatric asthma asthma

Funding4

—the China Association of Traditional Chinese Medicine (CATM) Innovative Development Program for Young Pediatricians
—The National Natural Science Foundation of China
—he National Natural Science Foundation of China
—the Natural Science Foundation of Shanghai

Keywords

Artificial intelligenceChatGPTClaudeGeminiDeepSeekPediatric asthma

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Text Readability and Simplification · Health Literacy and Information Accessibility

Full text

Introduction

Artificial intelligence (AI), particularly large language models (LLMs) driven by natural language processing (NLP), holds potential for respiratory medicine, offering capabilities ranging from clinical decision support [1] to personalized patient education [2–3]. In adult respiratory care, NLP tools have demonstrated utility in interpreting asthma control metrics and synthesizing guideline-based recommendations, thereby optimizing clinical workflows [4–6]. However, the translation of these technologies into pediatric asthma management presents unique challenges [7]. Unlike adult care, pediatric management hinges on the health literacy of caregivers, necessitating continuous, clear communication to ensure treatment adherence and effective self-management [8].

Despite the proliferation of LLMs such as ChatGPT, Gemini, Claude, and DeepSeek, their practical utility in this high-stakes context remains constrained by a critical dual requirement [9]: factual precision and linguistic accessibility. While prior studies have scrutinized the accuracy of LLM-generated medical content [10], two significant gaps persist. First, there is a paucity of systematic evaluations using validated quality assessment instruments, such as DICERN [11], Second, perhaps more critically, the readability of these outputs, a key determinant of their safety and effectiveness for non-expert users-has been largely overlooked [12]. It remains unclear [13] whether current algorithms can bridge the gap between complex clinical guidelines and the lay language required for effective caregivers education [14–15].

This study aims to systematically evaluate the performance of four LLMs (ChatGPT-4o, Gemini 2.0, Claude 3 Opus, and DeepSeek) in addressing pediatric asthma management [16]. By employing validated instruments to assess informational quality (DISCERN) alongside multiple readability indices, we will analyze the responses to a standardized set of guideline-based clinical questions. The primary objective is to determine whether these models can generate outputs that are simultaneously clinically reliable and linguistically accessible to clinicians and patients/caregivers. This comparative analysis seeks to inform the safe and practical integration of LLMs into asthma care, particularly for enhancing patient education and supporting shared clinical decision-making.

Methods

Study design and model description

This cross-sectional comparative study evaluated the performance of four publicly available large language models (LLMs): ChatGPT-4o (OpenAI), Claude 3 Opus (Anthropic), Gemini 2.0 (Google DeepMind), and DeepSeek (DeepSeek AI). All queries were submitted through the models’ standard web interfaces. To ensure response independence and minimize contextual carryover, each question was posed within a newly initiated chat thread. To account for potential variability in input generation, all queries were submitted independently by three study investigators.

Evaluators(participants)

Response quality was evaluated by an independent panel of three board-certified pediatric pulmonologists. Each expert possessed over ten years of clinical experience in pediatric respiratory practice and was blinded to the source (e.g., which AI model generated which response) during the scoring process. These specialists were not involved in the earlier stage of question development, thereby ensuring an unbiased assessment of response quality. Their primary role was to apply the DISCERN instrument to evaluate the accuracy, reliability, and comprehensiveness of each AI-generated answer.

Development of the clinical question set

The study utilized 15 clinical questions (Supplementary file 1) derived from the 2024 Global Initiative for Asthma (GINA) Pediatric Guideline [16].The questions were stratified into two categories to reflect distinct end-user needs:

Caregiver/Patient-focused (n = 7): Targeted disease mechanisms, daily management, medication safety, and trigger avoidance (e.g., Questions 3, 4, 5, 9,10,11,12).

Clinician-focused (n = 8): Centered on guideline-based treatment strategies, step-up/step-down therapy, and exacerbation management (e.g., Questions 1,2,6,7,8,13,14,15).

Questions validity was established through a three-phase process: (1) Domain extraction from GINA chapters; (2) Question Drafting; (3) Expert Review and Finalization.

Domain Extraction: Key clinical topics were systematically mapped from core GINA chapters, including diagnosis, pharmacotherapy, exacerbation management, and prevention.

Question Drafting: Preliminary questions were formulated to reflect common clinical scenarios and typical patient/caregiver concerns within these domains.

Expert Review and Finalization: The independent pediatric specialists reviewed all drafted questions for clinical relevance, clarity, and the availability of a definitive evidence-based answer within the GINA framework. Final inclusion required consensus, resulting in the 15-question set.

Outcome measures

Primary outcome: information quality (DISCERN)

The quality of response was assessed using the DISCERN instrument [17], a validated 16-item tool scoring from 1 (low quality) to 5 (high quality). Total score between 16 and 80, categorized as: Very Poor (16–26), Poor (27–38), Fair (39–50), Good (51–62), or Excellent (63–80). Three independent pediatric specialists, blinded to the AI model source, scored all responses. The mean score from the three raters for each question-model pair was used in the primary analysis.

Secondary outcome: text readability

Readability was assessed using six standardized indices calculated via the WebFX Readability Tool [18]: Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), Simple Measure of Gobbledygook (SMOG), Coleman-Liau Index (CLI), and Automated Readability Index (ARI). Following American Medical Association (AMA) and National Institutes of Health (NIH) recommendations, a grade level ≤ 8 (or FRE ≥ 60) was considered appropriate for patient-facing materials. [19] (Table 1).

Table 1. Readability indices and their measurement frameworksReadability IndicesMeasurement regulationsReading Level InterpretationFlesch Kincaid Reading Ease(FRE)Words per sentence and syllables per word0–30: Very difficult; 31–60: Difficult; 61–70: Moderately easy; 71–100: Easy; Target for patient materials: ≥60Flesch Kincaid Grade Level(FKGL)Length of each sentence and syllables persentenceUS Grade levels^a^(5th-grade to college graduate)Gunning Fog Index(Gunning FOG)Length, and complexity of each sentenceUS Grade levels^a^(5th-grade to college graduate)SMOG Index(SMOG)Complex word density of sentenceUS Grade levels^a^(5th-grade to college graduate)Coleman Liau Index (CLI)Number of characters and words per sentenceUS Grade levels^a^(5th-grade to college graduate)Automated Readability Index(ARI)Number of characters andlength of each sentenceUS Grade levels^a^(5th-grade to college graduate)^a^Each indice uses grade level as a unit of results^b^The Flesch Kincaid Reading Ease utilizes a different scoring system, in which case the scale is from 0

Statistical analysis

Power analysis

A power analysis was performed using G*Power 3.1. For a repeated-measures analysis of variance (ANOVA) with four group, assuming an alpha of 0.05, power of 0.80, and a moderate effect size (Cohen’s f = 0.30), a minimum of 12 question-answer pairs was required. This study utilized 15 pairs, providing adequate statistical power for the primary comparisons.

Data analysis

Statistical analysis was performed using SPSS (version 28.0). Inter‑rater reliability for DISCERN scores was assessed using the intraclass correlation coefficient (ICC; two‑way random model). Normality was evaluated using the Shapiro-Wilk test. DISCERN scores and Readability analysis were compared using a one-way ANOVA. Post hoc comparisons were performed using the Bonferroni correction to adjust for multiple comparisons [20].

Effect sizes were reported as partial eta-squared (η²) for ANOVA, Cohen’s d for paired t-tests, and r (r = Z/√N) for Wilcoxon tests to facilitate interpretation of clinical significant. Statistical significance was set at p <.05 (two‑tailed), with Dunn–Bonferroni post‑ hoc tests for multiple comparisons if significant.

Results

In our study, fifteen hypothetical asthma-related questions were delivered to the targeted AI tools, and the responses were rated by three independent experts using the DISCERN instrument to evaluate the overall quality of the health information. (Table 2: Chatgpt-4o as example). Further details on the performance of other AI tools are presented in Supplementary table 2S1.

Table 2DISCERN analysis summary for AI tool’s asthma consultations.(Take ChatGPT as example)ChatGPT-4oDISCERN scoresDISCERNCategorySpecialist1Specialist2Specialist3Mean ScoresQuestion 1What is the preferred initial treatment if a child has infrequent asthmasymptoms.(e.g.1-2days/week or less)?58536257.67GoodQuestion 2If I want to step down asthma treatment, how long should a child asthmatic symptoms be well controlled?56545956.33GoodQuestion 3Can asthma be triggered by strong emotions or will the symptoms be worsened because of the unstable emotions?55545153.33GoodQuestion 4During the asthma attack, do the air tubes collapse? Is More or less mucus produced in the air tubes?52505652.67GoodQuestion 5By listening to the asthma’s patient chest with a stethoscope, Is a doctor manage to measure how bad asthma is?53515352.33GoodQuestion 6What is the treatment option for children whose asthma is not adequately controlled by low-dose maintenance ICS-LABA with as-needed SABA?45495048.00MedianQuestion 7List some add-on biologic therapy for children with uncontrolled severe asthma.49525150.67GoodQuestion 8Will the risk of exacerbation increase when a child switch from Maintenance-and-reliever therapy(MART) to conventional lCS-LABA plus as-needed SABA?49485249.67MedianQuestion 9Will a child be addicted to asthma medications if he uses them frequently?46444846.00MedianQuestion 10After an asthmatic exacerbation, How soon should a review visit be scheduled? What does the frequency of visit depend on?40413940.00MedianQuestion 11If a child is exposed to viral infections or seasonal allergen exposure, Is a short-term increase in maintenance lCS dose for 1–2 weeks be necessary?55525052.33GoodQuestion 12When an a child with asthma is going to be exposed to something that triggers asthma, can he take medication just before exposure to prevent asthma?60585959.00GoodQuestion 13For children aged 6–11,On what condition can the the treatment be successfully reduced?63615760.33GoodQuestion 14Can high baseline of FeNO been used to predict exacerbation after step-down of lCS dose?62666263.33ExcellentQuestion 15Before asthma treatment step-down, what should be evaluated by patients?38363436.00BadTotal DISCERN score (16–80)n = 15Very bad (16–26)0Bad (27–38)1Median (39–50)4Good (51–62)9Excellent (63–80)1Average score51.87 ± 7.65Abbreviations: FeNO: fractional exhaled nitric oxide

Quality of AI-generated responses

A one-way analysis of variance (ANOVA) revealed no statistically significant difference in the quality of health information, measured by DISCERN scores, among the four AI platforms: F(3, 56) = 0.144, P =.933. The effect size was negligible (η² = 0.008). Descriptive statistics indicated a narrow range of mean scores: ChatGPT-4o (51.87 ± 7.65), Gemini 2.0 (51.60 ± 9.04), DeepSeek (50.57 ± 6.40), and Claude3-Opus (50.33 ± 8.53) (Table 3).

Table 3. One-way-ANOVA result for DISCERN scores across AI platformsAI platformsSSdfMSF P Effect Size (η²)Claude3-Opus(n = 15)Gemini 2.0(n = 15)ChatGPT-4o(n = 15)DeepSeek(n = 15)Disern scores(mean ± sd)50.33 ± 8.5351.60 ± 9.0451.87 ± 7.6550.57 ± 6.4025.1238.370.1440.9330.008Abbreviations: SS = Sum of Squares, df = Degrees of Freedom, MS = Mean Square, F = F-value, p = P-value, η² = Eta-squared (effect size)

The inter-rater reliability among the three independent pediatric specialists has demonstrated high consistency. Intraclass correlation coefficients (ICCs) for the four platforms ranged from 0.849 to 0.901 (all P <.001), confirming high reliability in expert evaluation.(Table 4).

Table 4. Intraclass correlationAI PlatformRaterDISCERN Score(mean ± sd)Median ScoreICCPfa r

P ^ICCb^ Claude3-OpusSpecialist 150.2 ± 8.5500.883< 0.0010.627Specialist 249.5 ± 9.251Specialist 351.3 ± 7.454Gemini 2.0Specialist 151.2 ± 8.9510.9< 0.0010.165Specialist 250.7 ± 9.851Specialist 352.9 ± 8.054ChatGPT-4oSpecialist 152.1 ± 7.6530.901< 0.0010.408Specialist 251.3 ± 7.552Specialist 352.2 ± 7.852DeepSeekSpecialist 149.6 ± 7.3500.849< 0.0010.189Specialist 250.3 ± 6.349Specialist 351.8 ± 5.552ICC = Intraclass correlation coefficient. All ICCs were calculated using a two-way random-effects model for absolute agreement. An ICC > 0.75 indicates good reliability^a^P^f^ (Friedman t test)^b^P^ICC^ (Intraclass Correlation)

Readability of AI-generated responses

The readability of the AI-generated responses, assessed across six standardized indices, demonstrated significant variations between platforms (Table 5). Shapiro-Wilk tests confirmed that all readability indices followed a normal distribution. Statistically significant differences were identified for three key metrics: Flesch Reading Ease (FRE) [F(3,56) = 5.19, P=.003, η²=0.22], Gunning Fog Index [F(3,56) = 3.11, P=.034, η²=0.14], and Coleman-Liau Index (CLI) [F(3,56) = 9.22, P<.001, η²=0.33].

Table 5. Readability metrics across AI platforms (Mean ± SD) Metrics Mean ± sdOne-way-ANOVAPartial (η2)Shapiro-WilkClaude3-OpusGemini2.0ChatGPT-4oDeepSeekFP-valueWP-valueFRE31.60 ± 11.1833.25 ± 8.2934.25 ± 8.8021.84 ± 10.245.1910.0030.2180.9690.129FKGL14.91 ± 2.9913.24 ± 1.4513.91 ± 2.1514.52 ± 2.791.3550.2660.0680.9790.407Gunning FOG18.89 ± 4.4615.57 ± 1.6716.27 ± 1.9716.88 ± 3.583.1070.034*0.1430.9040.208SMOG13.50 ± 2.1212.52 ± 1.3612.71 ± 1.4812.79 ± 2.770.6740.5710.0350.9870.758CLI15.25 ± 1.6715.95 ± 1.5215.14 ± 1.7418.03 ± 1.889.223< 0.0010.3310.9770.332ARI15.98 ± 3.6113.83 ± 1.6014.44 ± 2.1314.48 ± 3.531.5430.2130.0760.9710.165Abbreviations: FRE, Flesch Reading Ease; FKGL, Flesch-Kincaid Grade Level; SMOG, Simple Measure of Gobbledygook; CLI, Coleman-Liau Index; ARI, Automated Readability Index

Table 6. Characteristics of Flesch Kincaid reading Ease(FRE) generated by different AI toolsAI tools(FRE)Mean Difference (I-J)P-valueCohen’s dEffect SizeInterpretation95% Confidence IntervalLower BoundUpper BoundClaude3-OpusGemini 2.0-1.650.9990.170Small-11.33278.0327ChatGPT-4o-2.650.9990.273Small-12.33277.0327DeepSeek9.762000.0471.007Large0.079319.4447Gemini 2.0Claude3-Opus1.650.9990.170Small-8.032711.3327ChatGPT-4o-10.9990.103Small-10.68278.6827DeepSeek11.412000.0131.177Large1.729321.0947ChatGPT-4oClaude3-Opus2.650.9990.273Small-7.032712.3327Gemini 2.010.9990.103Small-8.682710.6827DeepSeek12.412000.0051.280Large2.729322.0947DeepSeekClaude3-Opus-9.762000.0471.007Large-19.4447-0.0793Gemini 2.0-11.412000.0131.177Large-21.0947-1.7293ChatGPT-4o-12.412000.0051.280Large-22.0947-2.7293^a^P<.05 are considered statistically significantCohen’s d > 0.8 can be interpreted as large effect

None of the AI platforms met established readability guidelines for patient materials. The mean Flesch-Kincaid Grade Level ranged from 13.2 to 14.9, and Flesch Reading Ease scores ranged from 21.8 to 34.3, categorizing all outputs as “difficult” and corresponding to a college-level reading requirement.

Further post-hoc pairwise comparisons with Bonferroni correction clarified these patterns. For the primary readability measure FRE (Table 6), ChatGPT-4o generated significantly more readable text than DeepSeek (mean difference = 12.41, P=.005, Cohen’s d = 1.28, large effect). Similarly, DeepSeek’s responses were significantly less readable than those from all other platforms (all P <.05, Cohen’s d > 1.0). No significant differences were observed among Claude3-Opus, Gemini 2.0, and ChatGPT-4o after correction (all P >.05). A similar pattern was observed for CLI, where DeepSeek’s text complexity was significantly higher than all other models (all P ≤.009). (Supplementary Table 2S2)

Stratified analysis of question types

To further clarify the performance of LLMs across targeted user group, stratified analyses were conducted for caregiver/patient-focused (n = 6) and clinician-focused (n = 9) questions (Supplementary Table 2S3). A clear pattern emerged three models (Claude 3 Opus, Gemini 2.0, and ChatGPT-4o). These models generated responses to caregiver/patient-focused questions that were significantly more readable than their responses to clinician-focused queries. ChatGPT-4o produced significantly more readable text for caregivers (FRE: 39.78 ± 6.82) compared to clinicians (FRE: 29.42 ± 7.60; mean difference = 10.36, t (13) = 2.76, P =.016). Similar, statistically significant differences in FRE were observed for Claude 3 Opus (mean diff. = 12.33, P =.027) and Gemini 2.0 (mean diff. = 10.57, P =.008). DeepSeek showed no significant adjustment in readability based on the target audience. All DISCERN score comparisons between caregiver- and clinician-focused responses were non-significant (all P >.39).

Discussion

Principal findings and interpretation

This study evaluated LLMs across the continuum of pediatric asthma communication by employing a question set that reflects needs of both clinicians and patients/caregivers. Our findings reveal that while these models achieve factual reliability on standardized queries, they exhibit significant and clinically relevant disparities in linguistic accessibility.

In this assessment, ChatGPT-4o achieved the highest mean score (51.87 ± 7.65), with over 90% of its responses rated above the median level, indicating its reliable performance in conveying pediatric asthma information. However, statistical analysis confirmed that all four platforms performed equivalently, as measured by the DISCERN instrument. This suggests that LLMs have attained a convergence in informational reliability, supporting their potential as adjunctive tools for clinician reference. [21–22].

Despite generally reliable performance, instances of incomplete or misleading information were noted even among responses categorized as “Good” (DISCERN score 51–62) by the instrument. For example, in Question 10 (post-exacerbation follow-up), ChatGPT-4o failed to specify the guideline-recommended “2–4 week” window, despite receiving a high DISCERN score. This discrepancy highlights a key limitation of the DISCERN instrument: it prioritizes overall information structure, reliability, and scope over granular clinical detail. Thus, a “Good” DISCERN rating does not guarantee the precision of context-specific details critical for pediatric asthma management, reinforcing the necessity of expert oversight in AI-assisted patient communication.

In stark contrast to the comparable information quality, we identified significant and disparities in textual readability. All models failed to meet the AMA/NIH-recommended 6th- to 8th-grade reading level for patient materials [23–24], with mean outputs corresponding to college-level complexity [25]. The mean FRE scores (21.8–34.3) further classify the outputs as“difficult.”

The approximately 12.41-point difference between ChatGPT-4o and DeepSeek (P =.005, Cohen’s d = 1.28) is educationally instructive, representing a shift of more than one full U.S. grade level in comprehension demand [26]. Besides, Significant differences in the Gunning Fog Index may arise from variations in sentence complexity and dense terminology, which can impede reading flow. The significantly higher CLI for DeepSeek indicating that a greater use of complex, longer words, increases cognitive load [27]. Collectively, these disparities indicate tangible variations in the potential comprehensibility of AI-generated patient instructions. Our stratified analysis further reveals that three models (Claude 3 Opus, Gemini 2.0, and ChatGPT-4o) demonstrates a degree of contextual adaptation, generating responses to caregiver/patient-focused questions that were significantly more readable than their outputs for clinicians, as evidenced by higher FRE scores. DeepSeek showed no such capability, producing text of uniformly high complexity regardless of the audience. This finding clarifies that the models’ utility is context-dependent. while AI tools may serve as reference tools for clinicians, their suitability for direct patient communication varies considerably based on their capacity for linguistic simplification.

Clinical implications and practical recommendations

The expanding role of AI technology within healthcare holds promise for both clinical decision support and patient care and education. The divergence between accuracy and accessibility necessitates a bifurcated implementation framework. For clinician-assistive applications, the priority remains factual accuracy and integration with authoritative medical databases. LLMs can function as efficient sources for preliminary information that requires expert verification.

For patient-facing or educational applications, readability must be elevated to a core performance metric. Poorly understandable instructions can precipitate adverse events [28]. Our data suggest that in the absence of simplification features, platforms like ChatGPT-4o, with their relatively higher baseline readability, are preferable for generating draft patient materials. Crucially, no current model is suitable for direct, unsupervised patient use.

A threshold informed by national surveys indicating median adult literacy was at the 6th-grade level [29]. This standard is further reinforced in pediatric care by CDC recommendations to optimize asthma education for caregivers with limited health literacy [30]. To bridge the accessibility gap, future development should integrate adaptive text simplification. This could involve simplifying sentence structures to organize content in a more understandable and logical manner, employing clear and plain language, providing comprehensive explanations [31] (e.g., footnotes), integrating visual elements and collecting educational background information to customize more targeted outputs.

Ultimately, LLMs should be positioned as adjuncts for professional judgment and humanized communication, with their application rigorously matched to the specific clinical context and patients’needs.

Limitations and methodological considerations

Several methodological aspects must be considered when interpreting our findings.

Firstly, our 15 questions were derived from the GINA 2024 guideline, focusing on clear, evidence-based answers without including clinically controversial scenarios. This may overestimate LLM performance in real-world settings, where questions often lack definitive answers. Additionally, the limited sample size may be insufficient to detect subtle but clinically meaningful differences in DISCERN scores (e.g., between ChatGPT-4o and Gemini 2.0). Secondly, we did not assess the actual comprehension of AI-generated content by caregivers with varying health literacy levels, relying solely on objective readability indices. Third, our evaluation relied on objective readability indices and expert scoring, and it did not assess the actual comprehension of AI outputs by caregivers with varying health literacy levels, a critical area for future validation research. Additionally, the rapidly evolving nature of AI models posed challenges for evaluation. This study focused on the current versions of AI models and our findings may not apply to future iterations, necessitating further reassessments.

Conclusion

Large Language Models (LLMs) demonstrate a dual potential in pediatric asthma care: they are clinically reliable but linguistically inaccessible. As decision-support tools for clinicians, these platforms can facilitate clinical reference. However, for direct patient education and communication, current LLMs are not yet fit for purpose due to a critical and uniform deficit in readability. Therefore, Future development advance on two parallel tracks: enhancing the clinical reasoning and verification capabilities of LLMs for professional support, while fundamentally prioritizing readability-by-design for public-oriented communication. Healthcare providers should engage with LLMs, rigorously safeguarding the clarity and safety of information delivered to patients.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary Material 2

Supplementary Material 3

Bibliography3

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Anthropic AI. The Claude 3 model family: Opus, Sonnet, Haiku. Claude-3 model card. 2024. Accessed [February 26, 2025]. https://www-cdn.anthropic.com/de 8ba 9b 01c 9ab 7cbabf 5c 33b 80b 7bbc 618857627/Model_Card_Claude_3.pdf
2GBD 2019 Diseases and Injuries Collaborators. Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019 [published correction appears in Lancet. 2020;396(10262):1562]. Lancet. 2020;396(10258):1204–1222. 10.1016/S 0140-6736(20)30925-910.1016/S 0140-6736(20)30925-9PMC 756702633069326 · doi ↗ · pubmed ↗
3Web Fx readability test. Web FX website. Accessed [September 10, 2024]. https://www.webfx.com/tools/read-able/