# Cross-modal attention model integrating tongue images and descriptions: a novel intelligent TCM approach for pathological organ diagnosis

**Authors:** Quan Gan, Chen Wang, Zhaoman Zhong, Jiaying Wu, Qiwei Ge, Lei Shi, Jiaqing Shang, Chuanxia Liu

PMC · DOI: 10.3389/fphys.2025.1580985 · Frontiers in Physiology · 2025-04-23

## TL;DR

This paper introduces a new AI model that combines tongue images and descriptions to improve traditional Chinese medicine diagnosis of organ conditions.

## Contribution

The novel cross-modal attention model integrates visual and textual data to enhance pathological organ diagnosis in TCM.

## Key findings

- The proposed model outperforms existing models in overall diagnostic accuracy.
- Multimodal fusion significantly improves performance compared to using images or text alone.

## Abstract

Tongue diagnosis is a fundamental technique in traditional Chinese medicine (TCM), where clinicians evaluate the tongue’s appearance to infer the condition of pathological organs. However, most existing research on intelligent tongue diagnosis primarily focuses on analyzing tongue images, often neglecting the important descriptive text that accompanies these images. This text is an essential component of clinical diagnosis. To overcome this gap, we propose a novel Cross-Modal Pathological Organ Diagnosis Model that integrates tongue images and textual descriptions for more accurate pathological classification

Our model extracts features from both the tongue images and the corresponding textual descriptions. These features are then fused using a cross-modal attention mechanism to enhance the classification of pathological organs. The cross-modal attention mechanism enables the model to effectively combine visual and textual information, addressing the limitations of using either modality alone

We conducted experiments using a self-constructed dataset to evaluate our model’s performance. The results demonstrate that our model outperforms common models regarding overall accuracy. Additionally, ablation studies, where either tongue images or textual descriptions were used alone, confirmed the significant benefit of multimodal fusion in improving diagnostic accuracy.

This study introduces a new perspective on intelligent tongue diagnosis in TCM by incorporating visual and textual data. The experimental findings highlight the importance of cross-modal feature fusion for improving the accuracy of pathological diagnosis. Our approach not only contributes to the development of more effective diagnostic systems but also paves the way for future advancements in the automation of TCM diagnosis.

## Full-text entities

- **Genes:** VIT (vitrin) [NCBI Gene 5212] {aka VIT1}
- **Diseases:** liver problems (MESH:D017093), heart disease (MESH:D006331), kidney-related diseases (MESH:D007674), liver cancer (MESH:D006528), verrucous gastritis (MESH:D005756), mushroom papillae hyperplasia (MESH:D006965), Qi deficiency (MESH:D007153), pulmonary/pleural tuberculosis (MESH:D014396), Yin deficiency (MESH:D016710), chronic glomerulonephritis (MESH:D005921), peptic ulcer (MESH:D010437), COVID-19 (MESH:D000086382), gastrointestinal diseases (MESH:D005767), nephritis (MESH:D009393), pneumonia (MESH:D011014), lesions (MESH:D009059), visceral lesions (MESH:D007418), coronary heart disease (MESH:D003327), CMPOD (MESH:D001523), Alzheimer's Disease (MESH:D000544)
- **Chemicals:** CMPOD (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12059375/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12059375/full.md

## References

41 references — full list in the complete paper: https://tomesphere.com/paper/PMC12059375/full.md

---
Source: https://tomesphere.com/paper/PMC12059375