Evaluation of large language models in rheumatology and clinical immunology: a systematic assessment based on Chinese national health professional qualification examination

Yaqing Wang; Yue Jiang; Wen Jin; Yijun Xu; Weinan Lin; Jiangda Wang; Qin Song; Zhaoxi Fang

PMC · DOI:10.3389/fmed.2025.1716122·January 15, 2026

Evaluation of large language models in rheumatology and clinical immunology: a systematic assessment based on Chinese national health professional qualification examination

Yaqing Wang, Yue Jiang, Wen Jin, Yijun Xu, Weinan Lin, Jiangda Wang, Qin Song, Zhaoxi Fang

PDF

Open Access

TL;DR

This study evaluates how well large language models perform in rheumatology and immunology using a Chinese medical exam.

Contribution

The paper provides a systematic evaluation of LLMs in a specific medical subfield using a national qualification exam.

Findings

01

DeepSeek-R1 and Qwen3 achieved over 90% accuracy in the exam.

02

LLMs showed significant variation in performance across different evaluation dimensions.

03

Professional practice ability tasks had lower performance, indicating limitations in clinical applications.

Abstract

In recent years, large language models (LLMs) have achieved remarkable progress in natural language processing and demonstrated potential applications in medicine. However, their professional capabilities in specific medical subfields, such as immunology, still require systematic evaluation. This study systematically evaluated 11 representative LLMs, including DeepSeek, GPT, Llama, Gemma, and Qwen series, based on the Chinese National Health Professional Qualification Examination in Rheumatology and Clinical Immunology. The evaluation covered four dimensions: basic medical knowledge, related medical knowledge, immunology knowledge, and professional practice ability. Results show significant differences among LLMs. DeepSeek-R1 and Qwen3 achieve the best performance, with accuracy exceeding 90%. However, performance on professional practice ability tasks remained relatively low,…

Figures1

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Clinical Reasoning and Diagnostic Skills · Rheumatoid Arthritis Research and Therapies