MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering

Mai A. Shaaban; Tausifa Jan Saleem; Vijay Ram Papineni; Mohammad Yaqub

arXiv:2506.22900·cs.CV·November 10, 2025

MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering

Mai A. Shaaban, Tausifa Jan Saleem, Vijay Ram Papineni, Mohammad Yaqub

PDF

TL;DR

MOTOR introduces a multimodal retrieval and re-ranking method using grounded captions and optimal transport to improve medical visual question answering accuracy by providing more relevant clinical context.

Contribution

It presents a novel multimodal retrieval and re-ranking approach that leverages grounded captions and optimal transport for better context relevance in MedVQA.

Findings

01

Achieves 6.45% higher accuracy than state-of-the-art methods.

02

Outperforms existing retrieval approaches in clinical relevance.

03

Validated by empirical analysis and human expert evaluation.

Abstract

Medical visual question answering (MedVQA) plays a vital role in clinical decision-making by providing contextually rich answers to image-based queries. Although vision-language models (VLMs) are widely used for this task, they often generate factually incorrect answers. Retrieval-augmented generation addresses this challenge by providing information from external sources, but risks retrieving irrelevant context, which can degrade the reasoning capabilities of VLMs. Re-ranking retrievals, as introduced in existing approaches, enhances retrieval relevance by focusing on query-text alignment. However, these approaches neglect the visual or multimodal context, which is particularly crucial for medical diagnosis. We propose MOTOR, a novel multimodal retrieval and re-ranking approach that leverages grounded captions and optimal transport. It captures the underlying relationships between the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.