# MedFusionT5: Cross-Modal Attention Boosts Semantic Quality and Reduces Hallucinations in Dental AI

**Authors:** Hamida Abdaoui, Sabri Barbaria, Ismail Dergaa, Halil İbrahim Ceylan, Nicola Luigi Bragazzi, Andrea de Giorgio, Ridha Ben Salah, Hanene Boussi Rahmouni

PMC · DOI: 10.1016/j.identj.2025.109404 · 2026-03-01

## TL;DR

MedFusionT5 improves dental AI reports by using cross-modal attention to enhance accuracy and reduce false information.

## Contribution

Introduces MedFusionT5, a unidirectional cross-modal alignment framework that reduces hallucinations in dental AI reports.

## Key findings

- MedFusionT5 outperformed baselines with a 122% increase in CIDEr and 320% over concatenation.
- Achieved a 2.42% hallucination rate, a 39% reduction compared to coattention baselines.
- Maintained high precision (0.982) and recall (0.923) across all report lengths.

## Abstract

Automated dental report generation faces significant challenges in multimodal fusion, often resulting in suboptimal semantic quality and risks of hallucination, where AI generates clinically unsupported content. Current approaches that rely on simple feature concatenation or bidirectional attention mechanisms fail to effectively capture visual-textual relationships in medical imaging. This study aims to develop MedFusionT5, a unidirectional cross-modal alignment framework that (1) achieves superior clinical report quality through focused attention between visual patches and clinical text representations, and (2) ensures exceptional factual consistency by minimising hallucination rates.

We implemented a novel architecture that integrates vision transformer (ViT) for patch-based visual feature extraction with Bio_ClinicalBERT for clinical text encoding. The core innovation is a unidirectional multihead attention alignment module that selectively maps textual embeddings to relevant visual patches before multimodal fusion. A T5-base decoder then generates diagnostic reports from the aligned representations. We evaluated performance on 700 dental panoramic radiographs using comprehensive metrics, including BLEU, ROUGE, CIDEr, clinical precision/recall, and specialised hallucination analysis, comparing against both concatenation and coattention baselines.

MedFusionT5 demonstrated superior performance across all evaluated metrics. Compared to the coattention baseline, CIDEr increased by 122% (5.65 vs 2.54) and by 320% over simple concatenation. BLEU-4 reached 0.865, outperforming both baselines, while maintaining the lowest hallucination rate at 2.42% (39% reduction vs coattention, 46% vs concatenation). The model achieved an optimal balance between precision (0.982) and recall (0.923), with 90% of reports exhibiting near-zero hallucination. Notably, MedFusionT5 showed consistent quality independent of report length (r = −0.022), unlike coattention's length-dependent performance (r = +0.795).

MedFusionT5 establishes a new state-of-the-art in automated dental report generation, demonstrating that unidirectional cross-modal alignment achieves superior semantic quality and clinical precision while minimising hallucinations. This work identifies unidirectional attention as the optimal alignment strategy for medical AI, providing a foundation for trustworthy clinical deployment where both accuracy and reliability are paramount.

## Full-text entities

- **Diseases:** dental injuries (MESH:D009057), Athletic injuries (MESH:D001265), Hallucination (MESH:D006212), trauma (MESH:D014947), temporomandibular joint disorders (MESH:D013705), mandibular fractures (MESH:D008337), dental emergencies (MESH:D004630), periapical infections (MESH:D010483), LLMs (MESH:D007806)
- **Chemicals:** MedFusionT5 (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12966729/full.md

---
Source: https://tomesphere.com/paper/PMC12966729