LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

Shihao Xu; Tiancheng Zhou; Jiatong Ma; Yanli Ding; Yiming Yan; Ming Xiao; Guoyi Li; Haiyang Geng; Yunyun Han; Jianhua Chen; Yafeng Deng

arXiv:2602.09379·cs.MA·February 12, 2026

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

Shihao Xu, Tiancheng Zhou, Jiatong Ma, Yanli Ding, Yiming Yan, Ming Xiao, Guoyi Li, Haiyang Geng, Yunyun Han, Jianhua Chen, Yafeng Deng

PDF

Open Access

TL;DR

This paper introduces LingxiDiagBench, a comprehensive multi-agent benchmark for evaluating large language models in Chinese psychiatric diagnosis, emphasizing both static and dynamic diagnostic tasks with a new dataset of 16,000 synthetic clinical dialogues.

Contribution

The paper presents LingxiDiagBench and LingxiDiag-16K, the first large-scale benchmark and dataset for Chinese psychiatric diagnosis involving multi-turn consultations and realistic patient simulations.

Findings

01

LLMs perform well on binary depression-anxiety classification (up to 92.3%)

02

Performance drops significantly on complex differential diagnoses (28.5%)

03

Dynamic consultation strategies often underperform static evaluations

Abstract

Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Mental Health via Writing · Digital Mental Health Interventions