Closing the gap in multimodal medical representation alignment

Eleonora Grassucci; Giordano Cicchetti; Danilo Comminiello

arXiv:2602.20046·cs.CV·February 24, 2026

Closing the gap in multimodal medical representation alignment

Eleonora Grassucci, Giordano Cicchetti, Danilo Comminiello

PDF

Open Access

TL;DR

This paper investigates the modality gap in multimodal medical representation learning and proposes a framework to improve semantic alignment between medical images and text, enhancing retrieval and captioning tasks.

Contribution

It identifies the presence of the modality gap in medical multimodal data and introduces a modality-agnostic method to close this gap, improving semantic alignment.

Findings

01

Enhanced cross-modal retrieval accuracy

02

Improved medical image captioning quality

03

Reduced modality gap in medical data representations

Abstract

In multimodal learning, CLIP has emerged as the de-facto approach for mapping different modalities into a shared latent space by bringing semantically similar representations closer while pushing apart dissimilar ones. However, CLIP-based contrastive losses exhibit unintended behaviors that negatively impact true semantic alignment, leading to sparse and fragmented latent spaces. This phenomenon, known as the modality gap, has been partially mitigated for standard text and image pairs but remains unknown and unresolved in more complex multimodal settings, such as the medical domain. In this work, we study this phenomenon in the latter case, revealing that the modality gap is present also in medical alignment, and we propose a modality-agnostic framework that closes this gap, ensuring that semantically related representations are more aligned, regardless of their source modality. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques