# Evaluation of large language models in rheumatology and clinical immunology: a systematic assessment based on Chinese national health professional qualification examination

**Authors:** Yaqing Wang, Yue Jiang, Wen Jin, Yijun Xu, Weinan Lin, Jiangda Wang, Qin Song, Zhaoxi Fang

PMC · DOI: 10.3389/fmed.2025.1716122 · 2026-01-15

## TL;DR

This study evaluates how well large language models perform in rheumatology and immunology using a Chinese medical exam.

## Contribution

The paper provides a systematic evaluation of LLMs in a specific medical subfield using a national qualification exam.

## Key findings

- DeepSeek-R1 and Qwen3 achieved over 90% accuracy in the exam.
- LLMs showed significant variation in performance across different evaluation dimensions.
- Professional practice ability tasks had lower performance, indicating limitations in clinical applications.

## Abstract

In recent years, large language models (LLMs) have achieved remarkable progress in natural language processing and demonstrated potential applications in medicine. However, their professional capabilities in specific medical subfields, such as immunology, still require systematic evaluation. This study systematically evaluated 11 representative LLMs, including DeepSeek, GPT, Llama, Gemma, and Qwen series, based on the Chinese National Health Professional Qualification Examination in Rheumatology and Clinical Immunology. The evaluation covered four dimensions: basic medical knowledge, related medical knowledge, immunology knowledge, and professional practice ability. Results show significant differences among LLMs. DeepSeek-R1 and Qwen3 achieve the best performance, with accuracy exceeding 90%. However, performance on professional practice ability tasks remained relatively low, highlighting limitations in complex clinical applications.

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12852015/full.md

---
Source: https://tomesphere.com/paper/PMC12852015