Can Large Language Models Function as Qualified Pediatricians? A Systematic Evaluation in Real-World Clinical Contexts
Siyu Zhu, Mouxiao Bian, Yue Xie, Yongyu Tang, Zhikang Yu, Tianbin Li, Pengcheng Chen, Bing Han, Jie Xu, Xiaoyan Dong

TL;DR
This study systematically evaluates whether large language models can serve as competent pediatricians in real-world clinical settings, revealing strengths in basic knowledge but limitations in complex reasoning, ethics, and dynamic decision-making.
Contribution
Introduces PEDIASBench, a comprehensive evaluation framework for pediatric LLMs, and provides an empirical assessment of 12 models across multiple clinical dimensions.
Findings
Models perform well on basic knowledge questions.
Performance drops by ~15% with increased task complexity.
Most models struggle with real-time patient adaptation and humanistic care.
Abstract
With the rapid rise of large language models (LLMs) in medicine, a key question is whether they can function as competent pediatricians in real-world clinical settings. We developed PEDIASBench, a systematic evaluation framework centered on a knowledge-system framework and tailored to realistic clinical environments. PEDIASBench assesses LLMs across three dimensions: application of basic knowledge, dynamic diagnosis and treatment capability, and pediatric medical safety and medical ethics. We evaluated 12 representative models released over the past two years, including GPT-4o, Qwen3-235B-A22B, and DeepSeek-V3, covering 19 pediatric subspecialties and 211 prototypical diseases. State-of-the-art models performed well on foundational knowledge, with Qwen3-235B-A22B achieving over 90% accuracy on licensing-level questions, but performance declined ~15% as task complexity increased,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Rare Diseases · Artificial Intelligence in Healthcare and Education · Clinical Reasoning and Diagnostic Skills
