# Benchmarking large language models against human experts in rehabilitation medicine: a multidimensional evaluation

**Authors:** Wenhui Cao, Mengjian Qu, Tao Zhu, Jing Liu, Ying Shen, Jihua Zou, Yi Li, Haiming Wang, Lisha Zhang, Huifang Liu, Qi Wu, Guijuan Zhou, Guanghua Sun, Helin Gong, Yaping Wan, Xiaofeng He, Jun Zhou

PMC · DOI: 10.1186/s12984-026-01903-0 · Journal of NeuroEngineering and Rehabilitation · 2026-03-02

## TL;DR

This study compares top AI models with human experts in creating rehabilitation plans, finding that some AI models outperform experts in certain areas.

## Contribution

The study introduces a multidimensional evaluation framework to benchmark LLMs against human experts in real-world rehabilitation medicine tasks.

## Key findings

- Grok-4 and Gemini−2.5-pro significantly outperformed human experts in generating rehabilitation plans.
- Open-source Deepseek-r1 also showed a statistically significant advantage over experts.
- Human experts excelled in strategic pathway design and humanistic care aspects.

## Abstract

Rehabilitation medicine faces a significant challenge due to the rising demand for services coupled with a shortage of specialized professionals. Large Language Models (LLMs) show promise for enhancing clinical efficiency, but their evaluation has been largely limited to simulated scenarios, lacking direct performance comparisons with human experts in complex, real-world clinical tasks.

To systematically benchmark five state-of-the-art LLMs against senior physiatrists in formulating comprehensive rehabilitation plans for authentic clinical cases, evaluating their utility as clinical decision support tools.

We conducted a rigorous, blinded evaluation using 48 authentic cases across six subspecialties. Plans generated by five LLMs (Grok-4, Gemini−2.5-pro, ChatGPT-5-2025-08-07, Deepseek-r1-0528, and Claude-opus-4-20250514) were compared with expert-authored plans. A panel of 6 senior physiatrists evaluated the plans using a multi-dimensional framework covering four key domains: Clinical Applicability and Safety (primary safety endpoint), Scientific Rigor, Individualization, and Clarity. To address the data’s hierarchical structure, we employed Linear Mixed-Effects Models (LMM) with random intercepts for cases and raters, and fixed effects for models and language. Pairwise comparisons were adjusted using the Holm-Bonferroni correction.

Quantitative analysis revealed that Grok-4 (mean 4.31) and Gemini−2.5-pro (mean 4.14) significantly outperformed the human benchmark (derived from standardized expert solutions) (mean 3.56; \documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$P<0.001$$\end{document}). Notably, the open-source Deepseek-r1 (mean 3.69) also achieved a statistically significant advantage over experts (\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$P<0.001$$\end{document}). Conversely, human experts scored numerically higher than Claude-opus-4 (mean 3.50), though this difference was not statistically significant (\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$P=0.099$$\end{document}). Qualitative analysis further highlighted human experts’ distinct strengths in strategic pathway design and humanistic care.

Top-tier LLMs demonstrate capability in generating high-quality, evidence-based plans, positioning them as effective “executors” for drafting preliminary regimens. We propose a human-AI collaboration paradigm where experts function as “strategists,” focusing on optimization and humanistic care to elevate rehabilitation service quality.

## Full-text entities

- **Genes:** F11R (F11 receptor) [NCBI Gene 50848] {aka CD321, JAM, JAM1, JAMA, JCAM, KAT}
- **Diseases:** bleeding (MESH:D006470), AI (MESH:C538142), LLMs (MESH:D007806), stroke (MESH:D020521), LMM (MESH:D004195), shoulder instability (MESH:D000070599), aphonia (MESH:D001044), cancer (MESH:D009369), Psychiatric (MESH:D001523), anxiety (MESH:D001007), hemiplegia (MESH:D006429), pelvic floor dysfunction (MESH:D059952), TCM syndrome (MESH:C562377), functional impairments (MESH:D003072), hallucinations (MESH:D006212), dysphagia (MESH:D003680)
- **Chemicals:** C&amp;S (MESH:D002586), Chinese herbal (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12951902/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12951902/full.md

## References

6 references — full list in the complete paper: https://tomesphere.com/paper/PMC12951902/full.md

---
Source: https://tomesphere.com/paper/PMC12951902