# Medical knowledge representation enhancement in large language models through clinical tokens optimization

**Authors:** Qianqian Li, Jijun Tong, Shanna Liu, Chang Li, Jie Tang, Qingli Zhou

PMC · DOI: 10.1038/s41598-026-37438-6 · Scientific Reports · 2026-01-29

## TL;DR

This paper improves medical language models by optimizing tokenization to better handle medical terms, leading to better performance on medical tasks.

## Contribution

The paper introduces clinical tokens to enhance medical term representation in LLMs through optimized tokenization.

## Key findings

- The clinical token-augmented tokenizer improves encoding and decoding efficiency.
- The model's effective context window is extended with the new tokenizer.
- Enhanced performance is achieved on downstream medical tasks.

## Abstract

During the training of medical large language models (LLMs), conventional tokenizers frequently segment domain-specific medical terms into multiple subword tokens, resulting in suboptimal recognition and representation of specialized vocabulary. As a consequence, the model encounters difficulties in effectively acquiring medical domain knowledge during the fine-tuning process. To address this limitation, the present study introduces “clinical tokens”—medical subword units—by augmenting the vocabulary of the original LLaMA2 tokenizer. This adapted tokenizer retains medical terms as whole tokens wherever feasible, thereby enhancing tokenization accuracy and enabling the model to learn and interpret medical knowledge more effectively. For downstream task adaptation, this study employs the Byte Pair Encoding (BPE) algorithm to construct a domain-specific vocabulary and tokenization model, ensuring the inclusion of medical subword units (clinical tokens). We compare the tokenization performance of three variants: the original LLaMA2 tokenizer, the Chinese-LLaMA2 tokenizer (expanded with an extended Chinese vocabulary), and the clinical token-augmented tokenizer. This was followed by fine-tuning the large language models on curated medical datasets. The experimental results indicate that the enhanced tokenizer improves encoding and decoding efficiency, extends the model’s effective context window, and yields superior performance on downstream medical tasks.

## Full-text entities

- **Diseases:** cardiovascular conditions (MESH:D002318), ischemic heart disease (MESH:D017202), myocardial hypertrophy (MESH:D006984), cerebral infarction (MESH:D002544), dizziness (MESH:D004244), cramps.\n\n4 (MESH:D009120), spasms (MESH:D013035), dysentery.\n\n2 (MESH:D004403), blood deficiency.\n\n2 (MESH:D006402), Hypertension (MESH:D006973), cerebral hemorrhage (MESH:D002543), menstrual disorders (MESH:D004412), intestinal cramps (MESH:D007410), abdominal discomfort.\n\n3 (MESH:D000007), hemiplegia (MESH:D006429), renal arteriosclerosis (MESH:D001161), renal insufficiency (MESH:D051437), weakness (MESH:D018908), diabetes mellitus (MESH:D003920), cardiac insufficiency (MESH:D000309), edema (MESH:D004487), gastrointestinal inflammation.\n\nImportant (MESH:D007249), hyperlipidemia (MESH:D006949), hyperglycemia (MESH:D006943), pain (MESH:D010146), coronary heart disease (MESH:D003327), hypotensive (MESH:D007022), kidney yang deficiency.\n\nImportant (MESH:D007680), lower back pain (MESH:D017116), OOV (MESH:D000070591), LLMs (MESH:D007806), fatigue (MESH:D005221), diarrhea (MESH:D003967)
- **Chemicals:** BBPE (-), lipid (MESH:D008055)
- **Species:** Atractylodes (genus) [taxon 41485], Homo sapiens (human, species) [taxon 9606], Astragalus (genus) [taxon 20400], Codonopsis pilosula (species) [taxon 86864], Angelica (genus) [taxon 40948], Glycyrrhiza (licorice, genus) [taxon 46347], Poria (genus) [taxon 87367], Panax ginseng (Asiatic ginseng, species) [taxon 4054], Zingiber officinale (ginger, species) [taxon 94328]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12910058/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12910058/full.md

## References

13 references — full list in the complete paper: https://tomesphere.com/paper/PMC12910058/full.md

---
Source: https://tomesphere.com/paper/PMC12910058