Location-Aware Pretraining for Medical Difference Visual Question Answering
Denis Musinguzi, Caren Han, Prasenjit Mitra

TL;DR
This paper introduces a location-aware pretraining framework for medical difference visual question answering, enhancing vision encoders to better detect subtle, clinically relevant changes in medical images.
Contribution
The authors propose a novel pretraining approach incorporating AREF, GCAP, and CAREF tasks to improve fine-grained, spatially grounded visual representations in medical VQA.
Findings
Achieves state-of-the-art performance on medical difference VQA tasks.
Improves the detection of subtle disease progression in chest X-ray images.
Enhances the reasoning about clinically relevant changes in medical images.
Abstract
Differential medical VQA models compare multiple images to identify clinically meaningful changes and rely on vision encoders to capture fine-grained visual differences that reflect radiologists' comparative diagnostic workflows. However, vision encoders trained using standard contrastive or classification objectives often fail to capture the subtle variations needed to distinguish true disease progression from acquisition-related variability. To address this limitation, we introduce a location-aware pretraining framework that incorporates automatic referring expressions (AREF), grounded captioning (GCAP), and conditional automatic referring expressions (CAREF). These tasks promote the learning of fine-grained, spatially grounded visual representations. When integrated with a language model, our approach achieves state-of-the-art performance on medical difference VQA by accurately…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
