# Application of a Large Visual Language Model on Tongue Image Description Generation and Physical Constitution Reasoning in Traditional Chinese Medicine (TongueVLM): Model Development and Validation Study

**Authors:** Chengdong Peng, Jun Gao, Nuo Yang, Yong Wang, Renming Chen, Changwu Dong

PMC · DOI: 10.2196/87237 · JMIR Medical Informatics · 2026-03-12

## TL;DR

This paper introduces TongueVLM, a specialized AI model for Traditional Chinese Medicine that generates tongue image descriptions and reasons about physical constitution.

## Contribution

The novel contribution is a domain-specific multimodal model for TCM that outperforms general models in tongue image analysis and constitution reasoning.

## Key findings

- TongueVLM achieved 79.8%, 78.6%, and 60.7% accuracy on three TCM-related tasks.
- It outperformed LLaVA-OneVision and Qwen2.5-VL-7B by significant margins in accuracy.
- The model generates text at a rate of 24 tokens per second.

## Abstract

In the field of traditional Chinese medicine (TCM), diagnostic work based on tongue images to recognize the physical constitution is a process of collecting clinical information, reasoning, and combining the patient’s tongue image features with questioning. It is necessary to simulate the recognition of pathological information of tongue images by TCM practitioners and professional dialogue based on tongue image features, which helps to develop an intelligent interactive system for TCM diagnosis.

This study aimed to develop and validate a vertical model of the TCM domain with TCM’s understanding and reasoning capability for tongue images.

A TongueVLM multimodal large model is designed, which includes a visual encoder module, a modal fusion module, and a language decoder module. First, the visual encoder based on the CLIP-ViT (Contrastive Language-Image Pre-Training With Vision Transformer) pretrained model is used for image patch, dimensionality reduction, and migration learning, which maps the high-dimensional tongue features into low-dimensional language encoding vectors. Further, a modal fusion module with a residual architecture is applied to map visual features to a natural language word embedding space, realizing the conceptual alignment between visual encoding and TCM terminology. Finally, fine-tuning of visual instructions is performed based on the LLaMA (large language model meta artificial intelligence), and a TCM-domain large language model with 7B parameters is trained.

The constructed multimodal dataset has 3 test datasets, and experiments are conducted using 3000 samples from each test dataset, respectively. Experimental results indicate that the TongueVLM model outperforms general-purpose large models on all 3 tasks. On the multimodal test dataset, the TongueVLM model achieved accuracy rates of 79.8%, 78.6%, and 60.7% in evaluation tasks respectively, it achieves 9.1%, 8.4%, and 1.1% in greater accuracy than LLaVA-OneVision, and is 7.5%, 7%, and 5.9% more accurate than Qwen2.5-VL-7B, with the text generation time being around 24 tokens per second.

The TongueVLM model, which achieves tongue image description generation and physical constitution reasoning in TCM, is suitable for the application of a Chinese medicine intelligent diagnosis system.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13022551/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13022551/full.md

## References

45 references — full list in the complete paper: https://tomesphere.com/paper/PMC13022551/full.md

---
Source: https://tomesphere.com/paper/PMC13022551