A benchmark dataset for evaluating Syndrome Differentiation and Treatment in large language models

Kunning Li; Jianbin Guo; Zhaoyang Shang; Yiqing Liu; Hongmin Du; Lingling Liu; Yuping Zhao; Lifeng Dong

arXiv:2512.02816·cs.CL·December 3, 2025

A benchmark dataset for evaluating Syndrome Differentiation and Treatment in large language models

Kunning Li, Jianbin Guo, Zhaoyang Shang, Yiqing Liu, Hongmin Du, Lingling Liu, Yuping Zhao, Lifeng Dong

PDF

Open Access

TL;DR

This paper introduces TCM-BEST4SDT, a comprehensive benchmark dataset for evaluating large language models' capabilities in Traditional Chinese Medicine, focusing on syndrome differentiation and treatment decision-making.

Contribution

It presents a novel, expert-annotated benchmark with multiple evaluation mechanisms specifically designed for TCM applications of LLMs.

Findings

01

Effective evaluation of 15 mainstream LLMs on TCM tasks.

02

Benchmark covers knowledge, ethics, safety, and syndrome differentiation.

03

Public release of the TCM-BEST4SDT dataset for future research.

Abstract

The emergence of Large Language Models (LLMs) within the Traditional Chinese Medicine (TCM) domain presents an urgent need to assess their clinical application capabilities. However, such evaluations are challenged by the individualized, holistic, and diverse nature of TCM's "Syndrome Differentiation and Treatment" (SDT). Existing benchmarks are confined to knowledge-based question-answering or the accuracy of syndrome differentiation, often neglecting assessment of treatment decision-making. Here, we propose a comprehensive, clinical case-based benchmark spearheaded by TCM experts, and a specialized reward model employed to quantify prescription-syndrome congruence. Data annotation follows a rigorous pipeline. This benchmark, designated TCM-BEST4SDT, encompasses four tasks, including TCM Basic Knowledge, Medical Ethics, LLM Content Safety, and SDT. The evaluation framework integrates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTraditional Chinese Medicine Studies · Machine Learning in Healthcare · Topic Modeling