Evaluating the performance of large language models in sarcopenia-related patient queries: a foundational assessment for patient-centered validation
Tao Huang, Ben Kirk, Jacqueline Close, Jae-young Lim, Gustavo Duque, Peter Ebeling, Minghui Yang, Maoyi Tian, Chun Sing Chui, Chaoran Liu, Ning Zhang, Wing-Hoi Cheung, Ronald Man Yeung Wong

TL;DR
This study evaluates how well three large language models answer questions about sarcopenia, finding that all perform well but with slight differences in specific areas.
Contribution
The study provides the first expert-based assessment of LLM performance in sarcopenia-related clinical queries.
Findings
All three LLMs achieved good performance with no 'Poor' responses across any domain.
Deepseek provided the longest and most detailed responses, while ChatGPT had the highest proportion of 'Good' ratings.
Gemini excelled in 'pathogenesis' and 'diagnosis' but received the most critical feedback in 'prevention and treatment.'
Abstract
Large Language Models (LLMs) have shown promise in clinical applications but their performance in specialized areas such as sarcopenia remains understudied. A panel of sarcopenia clinician researchers developed 20 standardized patient-centered questions across six clinical domains. Each question was input into all three LLMs, and responses were anonymized, randomized, and independently assessed by three clinician researchers. Accuracy was graded on a four-point scale (“Poor” to “Excellent”), and comprehensiveness was evaluated for responses rated “Good” or higher using a five-point scale. All LLMs achieved good performance, with no responses rated “Poor” across any domain. Deepseek had the longest and most detailed responses (mean word count: 583.75 ± 71.89) and showed superior performance in “risk factors” and “prognosis.” ChatGPT provided the most concise replies (359.5 ± 87.89…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Nutrition and Health in Aging
