LingLanMiDian: Systematic Evaluation of LLMs on TCM Knowledge and Clinical Reasoning

Rui Hua; Yu Wei; Zixin Shu; Kai Chang; Dengying Yan; Jianan Xia; Zeyu Liu; Hui Zhu; Shujie Song; Mingzhong Xiao; Xiaodong Li; Dongmei Jia; Zhuye Gao; Yanyan Meng; Naixuan Zhao; Yu Fu; Haibin Yu; Benman Yu; Yuanyuan Chen; Fei Dong; Zhizhou Meng; Pengcheng Yang; Songxue Zhao; Lijuan Pei; Yunhui Hu; Kan Ding; Jiayuan Duan; Wenmao Yin; Yang Gu; Runshun Zhang; Qiang Zhu; Jian Yu; Jiansheng Li; Baoyan Liu; Wenjia Wang; Xuezhong Zhou

arXiv:2602.01779·cs.AI·February 3, 2026

LingLanMiDian: Systematic Evaluation of LLMs on TCM Knowledge and Clinical Reasoning

Rui Hua, Yu Wei, Zixin Shu, Kai Chang, Dengying Yan, Jianan Xia, Zeyu Liu, Hui Zhu, Shujie Song, Mingzhong Xiao, Xiaodong Li, Dongmei Jia, Zhuye Gao, Yanyan Meng, Naixuan Zhao, Yu Fu, Haibin Yu, Benman Yu, Yuanyuan Chen, Fei Dong, Zhizhou Meng, Pengcheng Yang, Songxue Zhao

PDF

Open Access

TL;DR

The paper introduces LingLanMiDian, a comprehensive benchmark for evaluating large language models on Traditional Chinese Medicine tasks, highlighting current models' limitations in domain-specific reasoning and knowledge understanding.

Contribution

It presents a unified, expert-curated evaluation suite for TCM LLMs, including new metrics, protocols, and a hard subset for rigorous assessment.

Findings

01

Current LLMs lag behind human experts in TCM reasoning.

02

LingLan benchmark reveals significant gaps in knowledge recall and reasoning.

03

Evaluation data and code are publicly available for further research.

Abstract

Large language models (LLMs) are advancing rapidly in medical NLP, yet Traditional Chinese Medicine (TCM) with its distinctive ontology, terminology, and reasoning patterns requires domain-faithful evaluation. Existing TCM benchmarks are fragmented in coverage and scale and rely on non-unified or generation-heavy scoring that hinders fair comparison. We present the LingLanMiDian (LingLan) benchmark, a large-scale, expert-curated, multi-task suite that unifies evaluation across knowledge recall, multi-hop reasoning, information extraction, and real-world clinical decision-making. LingLan introduces a consistent metric design, a synonym-tolerant protocol for clinical labels, a per-dataset 400-item Hard subset, and a reframing of diagnosis and treatment recommendation into single-choice decision recognition. We conduct comprehensive, zero-shot evaluations on 14 leading open-source and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTraditional Chinese Medicine Studies · Machine Learning in Healthcare · Biomedical Text Mining and Ontologies