Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning

Yogesh Thakku Suresh; Vishwajeet Shivaji Hogale; Luca-Alexandru Zamfira; Anandavardhana Hegde

arXiv:2510.25164·eess.IV·November 3, 2025

Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning

Yogesh Thakku Suresh, Vishwajeet Shivaji Hogale, Luca-Alexandru Zamfira, Anandavardhana Hegde

PDF

TL;DR

This paper introduces a transformer-based multimodal framework for generating accurate, clinically relevant captions for MRI scans by aligning image and text embeddings, improving medical image reporting.

Contribution

It presents a novel hybrid transformer architecture combining vision and language models with a specialized loss for better semantic alignment in medical imaging.

Findings

01

Improved caption accuracy on the MultiCaRe dataset.

02

Enhanced semantic alignment between images and captions.

03

Outperforms existing state-of-the-art methods in medical image captioning.

Abstract

We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent transformer-based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.