# Evaluating and Validating Large Language Models for Health Education on Developmental Dysplasia of the Hip: 2-Phase Study With Expert Ratings and a Pilot Randomized Controlled Trial

**Authors:** Hui Ouyang, Gan Lin, Yiyuan Li, Zhixin Yao, Yating Li, Han Yan, Fang Qin, Jinghui Yao, Yun Chen

PMC · DOI: 10.2196/73326 · Journal of Medical Internet Research · 2026-01-19

## TL;DR

This study evaluates how well large language models can create educational materials for developmental dysplasia of the hip and finds they improve knowledge and eHealth literacy in caregivers.

## Contribution

The study introduces a 2-phase evaluation framework combining expert ratings and a pilot trial to assess LLMs for health education on DDH.

## Key findings

- ChatGPT-4 and DeepSeek-V3 outperformed other LLMs in accuracy and richness of DDH educational content.
- LLM-assisted education improved caregivers' eHealth literacy and DDH knowledge compared to web search.
- LLMs cannot fully replace clinical evaluation but can support general informational needs.

## Abstract

Developmental dysplasia of the hip (DDH) is a common pediatric orthopedic disease, and health education is vital to disease management and rehabilitation. The emergence of large language models (LLMs) has provided new opportunities for health education. However, the effectiveness and applicability of LLMs in education with DDH have not been systematically evaluated.

This study conducted an integrated 2-phase evaluation to assess the quality and educational effectiveness of LLM-generated educational materials.

This study comprised 2 phases. Based on Bloom’s taxonomy, a 16-item DDH question bank was created through literature analysis and collaboration. Four LLMs (ChatGPT-4 [OpenAI], DeepSeek-V3, Gemini 2.0 Flash [Google], and Copilot [Microsoft Corp]) were questioned using standardized prompts. All responses were independently evaluated by 5 pediatric orthopedic experts using 5-point Likert measures of accuracy, fluency, and richness, the scales of Patient Education Materials Assessment Tool for Printable Materials, and DISCERN. The readability was measured by a formula. The data were examined using Kruskal-Wallis tests, ANOVA, and post hoc comparisons. In phase 2, an assessor-blinded, 2-arm pilot randomized controlled trial was conducted. A total of 127 caregivers were randomized into an LLM-assisted education group or a web search control group. The intervention included structured LLM training, supervised practice, and 2 weeks of reinforcement training. Measured at baseline, postintervention, and 2 weeks following, the outcomes were eHealth literacy (primary), DDH knowledge, health risk perception, perceived usefulness, information self-efficacy, and health information-seeking behavior. Cohen d effect sizes and linear mixed-effects models were used in an intention-to-treat manner.

There were significant differences between the 4 LLMs concerning accuracy, richness, fluency, Patient Education Materials Assessment Tool for Printable Materials Understandability, and DISCERN (P<.05). ChatGPT-4 (median 63.67, IQR 63.67-64.67) and DeepSeek-V3 (median 63.67, IQR 63.33-64.67) generate more accurate text than Copilot (median 59.00, IQR 58.67-59.67). DeepSeek-V3 (median 64.00, IQR 64.00-64.00) was language richer than Copilot (median 52.33, IQR 51.33-52.67). Gemini 2.0 Flash (median 72.67, IQR 72.33-73.00) was more fluent than Copilot (median 65.67, IQR 63.33-65.67). In phase 2, the intervention group showed higher eHealth literacy at T1 (33.62, 95% CI 32.76-34.49; d=0.20, 95% CI 0.13-0.56) and T2 (33.27, 95% CI 32.38-34.17; d=0.36, 95% CI 0.01-0.80), greater DDH knowledge at T1 (7.87, 95% CI 7.48-8.25, d=0.71, 95% CI 0.33-1.11) and T2 (7.12, 95% CI 6.72-7.51; d=0.54, 95% CI 0.17-0.96), and slight improvements in health risk prediction and perceived usefulness.

Mainstream LLMs demonstrate varying capacities in generating educational content for DDH. They generated DDH caregiver education materials that were associated with modest improvements in eHealth literacy and knowledge. Although LLMs can address general informational needs, they cannot completely substitute clinical evaluation. Future research should focus on optimizing plain language, refining dialogue design, and enhancing audience personalization to improve the quality of LLMs’ materials.

Chinese Clinical Trial Registry ChiCTR2500108410; https://www.chictr.org.cn/showproj.html?proj=271987

## Linked entities

- **Diseases:** Developmental dysplasia of the hip (MONDO:0000158)

## Full-text entities

- **Diseases:** DDH (MESH:D000082602), gait abnormalities (MESH:D020233), anxiety (MESH:D001007), depression (MESH:D003866), HISBS (MESH:C538175), bipolar disorder (MESH:D001714), hallucinations (MESH:D006212), cancer (MESH:D009369), chronic pain (MESH:D059350), osteoarthritis (MESH:D010003), schizophrenia (MESH:D012559), mental illnesses (MESH:D001523), LLMs (MESH:D007806), orthopedic condition (MESH:D009140), hearing or visual impairment (MESH:D006311)
- **Chemicals:** FKGL (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12865344/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12865344/full.md

## References

59 references — full list in the complete paper: https://tomesphere.com/paper/PMC12865344/full.md

---
Source: https://tomesphere.com/paper/PMC12865344