# Tuning and clinical application of large language models in Traditional Chinese Medicine: scoping review

**Authors:** Changxiao Han, Guangyi Yang, Hongtao Li, Liguo Zhu, Minshan Feng

PMC · DOI: 10.1186/s13020-026-01346-8 · Chinese Medicine · 2026-02-19

## TL;DR

This review explores how large language models are being adapted for Traditional Chinese Medicine, focusing on their tuning methods, data use, and clinical applications.

## Contribution

The study provides a systematic scoping review of LLMs in TCM, highlighting tuning techniques and evaluation methods specific to this domain.

## Key findings

- LoRA fine-tuning is the most common technique for adapting LLMs in TCM.
- Most models combine multiple tuning methods and use a mix of theoretical and clinical data.
- Current models struggle with simulating complex TCM reasoning and individualized diagnosis.

## Abstract

Large Language Models (LLMs) show significant potential in healthcare, but their application in Traditional Chinese Medicine (TCM) lacks systematic evaluation. This study aims to comprehensively review LLMs tuning techniques, data construction strategies, evaluation methods, and application scenarios in TCM clinical practice.

A scoping review following PRISMA-ScR guidelines was conducted. Researchers systematically searched seven databases for relevant studies published between database inception to May 2025. The analysis focused on identifying model characteristics, tuning techniques, data sources, evaluation methods, application domains and performance limitations to assess the current state and future directions of TCM-oriented LLMs.

We included 27 studies (21 in English, 6 in Chinese). Application domains comprised TCM knowledge consultation (10 studies) and diagnostic assistance (13 studies), with 4 studies establishing TCM LLMs evaluation benchmarks. LoRA fine-tuning was most widely used (65.2%), often combined with prompt engineering (47.8%), continued pre-training (43.5%), and retrieval-augmented generation (39.1%). Most studies (87.0%) employed multiple technique combinations. Training data balanced theoretical knowledge (classics) with clinical experience (case records), though multimodal data remained severely insufficient. Evaluation methods were multidimensional, with accuracy (63.0%) and human assessment (77.8%) most frequently used. Specialized TCM evaluation benchmarks were gradually established. Current models excel at integrating heterogeneous knowledge, basic syndrome differentiation reasoning, and cross-language knowledge conversion, but show limitations in simulating complex TCM reasoning processes and individualized diagnosis.

Although TCM-oriented LLMs demonstrate effectiveness in knowledge consultation and diagnostic tasks, they face significant challenges in capturing TCM's holistic paradigm, data quality, and clinical evaluation. Future research should develop TCM-compatible model architectures, build standardized multimodal data ecosystems, strengthen clinical translation, and create evaluation frameworks that reflect TCM's diagnostic process.

The online version contains supplementary material available at 10.1186/s13020-026-01346-8.

## Full-text entities

- **Genes:** NINL (ninein like) [NCBI Gene 22981] {aka NLP}
- **Diseases:** TCM (MESH:C562377), LLMs (MESH:D007806), digestive disease (MESH:D004066), epidemic diseases (MESH:D004671), Hallucination (MESH:D006212), rheumatoid arthritis (MESH:D001172), HL (MESH:C538324)
- **Chemicals:** CPT (MESH:C000708228), DPO (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12922203/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12922203/full.md

## References

2 references — full list in the complete paper: https://tomesphere.com/paper/PMC12922203/full.md

---
Source: https://tomesphere.com/paper/PMC12922203