# Assessing the Quality of AI Responses to Patient Concerns About Axial Spondyloarthritis: Delphi-Based Evaluation

**Authors:** Jiaxin Bai, Xiaojian Ji, Jiali Yu, Yiwen Wang, Yufei Guo, Chao Xue, Wenrui Zhang, Jian Zhu

PMC · DOI: 10.2196/79153 · 2026-01-07

## TL;DR

This study evaluates how well AI models provide health advice for axial spondyloarthritis patients, finding that while some models perform well, risks like hallucinations remain.

## Contribution

A Delphi-based tool was developed to assess AI responses for axSpA, revealing age-related and clinician-patient differences in health priorities.

## Key findings

- Younger patients prioritized symptom management and medication side effects more than older patients.
- LLMs showed higher accuracy in diagnosis/examination than in treatment/medication domains.
- GPT-4.0 and Kimi k1.5 had the best readability, but hallucinations remain a critical barrier to safe AI use.

## Abstract

Axial spondyloarthritis (axSpA) is a chronic autoinflammatory disease with heterogeneous clinical features, presenting considerable complexity for sustained patient self-management. Although the use of large language models (LLMs) in health care is rapidly expanding, there has been no rigorous assessment of their capacity to provide axSpA-specific health guidance.

This study aimed to develop a patient-centered needs assessment tool and conduct a systematic evaluation of the quality of LLM-generated health advice for patients with axSpA.

A 2-round Delphi consensus process guided the design of the questionnaire, which was subsequently administered to 84 patients with axSpA and 26 rheumatologists. Patient-identified key concerns were formulated and input into 5 LLM platforms (GPT-4.0, DeepSeek R1, Hunyuan T1, Kimi k1.5, and Wenxin X1), with all prompts and model outputs in Chinese. Responses were evaluated using 2 techniques: an accuracy assessment based on guideline concordance, with independent double blinding by 2 raters (interrater reliability analyzed via Cohen κ), and the AlphaReadabilityChinese analytic tool to assess readability.

Analysis of the validated questionnaire revealed age-related differences. Patients younger than 40 years prioritized symptom management and medication side effects more than those older than 40 years. Distinct priorities between clinicians and patients were identified for diagnostic mimics and drug mechanisms. LLM accuracy was highest in the diagnosis and examination category (mean score 20.4, SD 0.9) but lower in treatment and medication domains (mean score 19.3, SD 1.7). GPT-4.0 and Kimi k1.5 demonstrated superior overall readability; safety remained generally high (disclaimer rates: GPT-4.0 and DeepSeek-R1 100%; Kimi k1.5 88%).

Needs assessment across age groups and observed divergences between clinicians and patients underline the necessity for customized patient education. LLMs performed robustly on most evaluation metrics, and GPT-4.0 achieved 94% overall agreement with clinical guidelines. These tools hold promise as scalable adjuncts for ongoing axSpA support, provided complex clinical decision-making remains under human oversight. Nevertheless, the prevalence of artificial intelligence hallucinations remains a critical barrier. Only through comprehensive mitigation of such risks can LLM-based medical support be safely accelerated.

## Full-text entities

- **Diseases:** hallucinations (MESH:D006212), Axial Spondyloarthritis (MESH:D000089183), autoinflammatory disease (MESH:D056660)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12824573/full.md

---
Source: https://tomesphere.com/paper/PMC12824573