# Reinforcement learning for LLM-based explainable TCM prescription recommendation with implicit preferences from small language models

**Authors:** Xinyu Wang, Xiaohe Sun, Lei Yang, Yitong Zhang, Tao Yang, Jiadong Xie, Kongfa Hu

PMC · DOI: 10.1186/s13020-025-01250-7 · 2025-11-19

## TL;DR

This paper introduces a two-stage framework using reinforcement learning and knowledge distillation to improve the accuracy and explainability of Traditional Chinese Medicine prescription recommendations.

## Contribution

A novel two-stage training framework combining knowledge distillation and implicit preference-driven reinforcement learning for explainable TCM prescription recommendation.

## Key findings

- The model achieves a P@30 of 35.62% and F1@30 of 37.36%, outperforming existing baselines.
- Knowledge distillation improves generalization and explainability, while reinforcement learning enhances F1@30 by 2.01%.

## Abstract

In an effort to reinforce both the interpretability and accuracy of prescription recommendations in Traditional Chinese Medicine (TCM), this study puts forward a two-stage training framework that integrates knowledge distillation from a teacher model with implicit preference-driven reinforcement learning grounded in a compact model.

Above all, GPT-4o is employed to parse structured TCM clinical records, creating high-quality distillation samples. These are employed to guide Low-Rank Adaptation (LoRA)-based fine-tuning of the Qwen2.5-7B model, enabling it to generate explainable outputs in the format of "symptom analysis—prescription recommendation—prescription explanation". Then, a lightweight BART (Bidirectional and Auto-Regressive Transformers) model is trained to learn the mapping relation between symptoms and prescriptions. Its outputs are compared with those of the large model to construct preference pairs, which are subsequently utilized in Direct Preference Optimization (DPO)-based reinforcement tuning to further align the model with potentially better recommendations.

The suggested model achieves a P@30 of 35.62% and F1@30 of 37.36%, outperforming existing baselines. Knowledge distillation contributes to the improvement of the model's generalization and explainability, while implicit preference-based reinforcement further enhances F1@30 by 2.01%. Overall, the model obtains more desirable performance in both accuracy and explainability.

The recommended approach not only improves the quality and transparency of TCM prescription recommendations, but also offers a fruitful strategy for building trustworthy and clinically applicable intelligent TCM decision-support systems.

## Full-text entities

- **Genes:** GGTLC5P (gamma-glutamyltransferase light chain 5 pseudogene) [NCBI Gene 653590] {aka GGT}, SLC17A5 (solute carrier family 17 member 5) [NCBI Gene 26503] {aka AST, ISSD, NSD, SD, SIALIN, SIASD}
- **Diseases:** hepatic and gallbladder dysfunction (MESH:D005705), liver pain (MESH:D017093), abdominal distension (MESH:D000007), soreness (MESH:D063806), LLMs (MESH:D007806), dry bowel (MESH:D015352), TCM (MESH:C562377), liver-kidney yin deficiency (MESH:D016710), CoT (MESH:D007161), Stasis (MESH:D014647), MACCTM (MESH:D041781), hallucination (MESH:D006212), cancer (MESH:D009369), dry mouth (MESH:D014987), Symptom (MESH:D012816), abnormalities of the liver (MESH:D008107), hepatic (MESH:D056486), hypochondriac (MESH:D006998), Pain (MESH:D010146)
- **Chemicals:** DPO (-), water (MESH:D014867)
- **Species:** Sedum sarmentosum (species) [taxon 91146], Taraxacum officinale (dandelion, species) [taxon 50225], Bambusa tuldoides (species) [taxon 318046], Magnolia (genus) [taxon 3402], Salvia miltiorrhiza (Chinese salvia, species) [taxon 226208], Sophora flavescens (species) [taxon 49840], Homo sapiens (human, species) [taxon 9606], Coptis (genus) [taxon 3441], Emblica urinaria (ye xia zhu, species) [taxon 296035]

## Figures

19 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12628926/full.md

---
Source: https://tomesphere.com/paper/PMC12628926