# Evaluating the performance of large language models in sarcopenia-related patient queries: a foundational assessment for patient-centered validation

**Authors:** Tao Huang, Ben Kirk, Jacqueline Close, Jae-young Lim, Gustavo Duque, Peter Ebeling, Minghui Yang, Maoyi Tian, Chun Sing Chui, Chaoran Liu, Ning Zhang, Wing-Hoi Cheung, Ronald Man Yeung Wong

PMC · DOI: 10.3389/fragi.2026.1712785 · 2026-02-27

## TL;DR

This study evaluates how well three large language models answer questions about sarcopenia, finding that all perform well but with slight differences in specific areas.

## Contribution

The study provides the first expert-based assessment of LLM performance in sarcopenia-related clinical queries.

## Key findings

- All three LLMs achieved good performance with no 'Poor' responses across any domain.
- Deepseek provided the longest and most detailed responses, while ChatGPT had the highest proportion of 'Good' ratings.
- Gemini excelled in 'pathogenesis' and 'diagnosis' but received the most critical feedback in 'prevention and treatment.'

## Abstract

Large Language Models (LLMs) have shown promise in clinical applications but their performance in specialized areas such as sarcopenia remains understudied.

A panel of sarcopenia clinician researchers developed 20 standardized patient-centered questions across six clinical domains. Each question was input into all three LLMs, and responses were anonymized, randomized, and independently assessed by three clinician researchers. Accuracy was graded on a four-point scale (“Poor” to “Excellent”), and comprehensiveness was evaluated for responses rated “Good” or higher using a five-point scale.

All LLMs achieved good performance, with no responses rated “Poor” across any domain. Deepseek had the longest and most detailed responses (mean word count: 583.75 ± 71.89) and showed superior performance in “risk factors” and “prognosis.” ChatGPT provided the most concise replies (359.5 ± 87.89 words, p = 0.0011) but achieved the highest proportion of “Good” ratings (90%). Gemini excelled in “pathogenesis” and “diagnosis” but received the most critical feedback in “prevention and treatment.” Although trends in performance differences were noted, they did not reach statistical significance. Mean comprehensiveness scores were also similar across models (Deepseek: 4.017 ± 0.77, Gemini: 3.97 ± 0.88, ChatGPT: 3.953 ± 0.83; p > 0.05).

Despite minor differences in performance across domains, all three LLMs demonstrated acceptable accuracy and comprehensiveness when responding to sarcopenia-related queries. Their comparable results may reflect similarly recent training data and language capabilities. These findings suggest that LLMs could potentially serve as a valuable tool in patient education and care on sarcopenia. This study provides an initial, expert-based assessment of LLM information quality regarding sarcopenia. While the responses demonstrated good accuracy, this evaluation focuses on content correctness from a clinical perspective. Future research must complement these findings by directly engaging older adult cohorts before clinical implementation can be considered. However, human oversight remains essential to ensure safe and appropriate assessment and individually tailored advice and management.

## Full-text entities

- **Diseases:** sarcopenia (MESH:D055948)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12983229/full.md

---
Source: https://tomesphere.com/paper/PMC12983229