Evaluating the performance of large language models in sarcopenia-related patient queries: a foundational assessment for patient-centered validation

Tao Huang; Ben Kirk; Jacqueline Close; Jae-young Lim; Gustavo Duque; Peter Ebeling; Minghui Yang; Maoyi Tian; Chun Sing Chui; Chaoran Liu; Ning Zhang; Wing-Hoi Cheung; Ronald Man Yeung Wong

PMC · DOI:10.3389/fragi.2026.1712785·February 27, 2026

Evaluating the performance of large language models in sarcopenia-related patient queries: a foundational assessment for patient-centered validation

Tao Huang, Ben Kirk, Jacqueline Close, Jae-young Lim, Gustavo Duque, Peter Ebeling, Minghui Yang, Maoyi Tian, Chun Sing Chui, Chaoran Liu, Ning Zhang, Wing-Hoi Cheung, Ronald Man Yeung Wong

PDF

Open Access

TL;DR

This study evaluates how well three large language models answer questions about sarcopenia, finding that all perform well but with slight differences in specific areas.

Contribution

The study provides the first expert-based assessment of LLM performance in sarcopenia-related clinical queries.

Findings

01

All three LLMs achieved good performance with no 'Poor' responses across any domain.

02

Deepseek provided the longest and most detailed responses, while ChatGPT had the highest proportion of 'Good' ratings.

03

Gemini excelled in 'pathogenesis' and 'diagnosis' but received the most critical feedback in 'prevention and treatment.'

Abstract

Large Language Models (LLMs) have shown promise in clinical applications but their performance in specialized areas such as sarcopenia remains understudied. A panel of sarcopenia clinician researchers developed 20 standardized patient-centered questions across six clinical domains. Each question was input into all three LLMs, and responses were anonymized, randomized, and independently assessed by three clinician researchers. Accuracy was graded on a four-point scale (“Poor” to “Excellent”), and comprehensiveness was evaluated for responses rated “Good” or higher using a five-point scale. All LLMs achieved good performance, with no responses rated “Poor” across any domain. Deepseek had the longest and most detailed responses (mean word count: 583.75 ± 71.89) and showed superior performance in “risk factors” and “prognosis.” ChatGPT provided the most concise replies (359.5 ± 87.89…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases1

sarcopenia

Figures3

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Nutrition and Health in Aging