Can Large Language Models Function as Qualified Pediatricians? A Systematic Evaluation in Real-World Clinical Contexts

Siyu Zhu; Mouxiao Bian; Yue Xie; Yongyu Tang; Zhikang Yu; Tianbin Li; Pengcheng Chen; Bing Han; Jie Xu; Xiaoyan Dong

arXiv:2511.13381·cs.CL·November 18, 2025

Can Large Language Models Function as Qualified Pediatricians? A Systematic Evaluation in Real-World Clinical Contexts

Siyu Zhu, Mouxiao Bian, Yue Xie, Yongyu Tang, Zhikang Yu, Tianbin Li, Pengcheng Chen, Bing Han, Jie Xu, Xiaoyan Dong

PDF

Open Access

TL;DR

This study systematically evaluates whether large language models can serve as competent pediatricians in real-world clinical settings, revealing strengths in basic knowledge but limitations in complex reasoning, ethics, and dynamic decision-making.

Contribution

Introduces PEDIASBench, a comprehensive evaluation framework for pediatric LLMs, and provides an empirical assessment of 12 models across multiple clinical dimensions.

Findings

01

Models perform well on basic knowledge questions.

02

Performance drops by ~15% with increased task complexity.

03

Most models struggle with real-time patient adaptation and humanistic care.

Abstract

With the rapid rise of large language models (LLMs) in medicine, a key question is whether they can function as competent pediatricians in real-world clinical settings. We developed PEDIASBench, a systematic evaluation framework centered on a knowledge-system framework and tailored to realistic clinical environments. PEDIASBench assesses LLMs across three dimensions: application of basic knowledge, dynamic diagnosis and treatment capability, and pediatric medical safety and medical ethics. We evaluated 12 representative models released over the past two years, including GPT-4o, Qwen3-235B-A22B, and DeepSeek-V3, covering 19 pediatric subspecialties and 211 prototypical diseases. State-of-the-art models performed well on foundational knowledge, with Qwen3-235B-A22B achieving over 90% accuracy on licensing-level questions, but performance declined ~15% as task complexity increased,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Rare Diseases · Artificial Intelligence in Healthcare and Education · Clinical Reasoning and Diagnostic Skills