Bi-MCQ: Reformulating Vision-Language Alignment for Negation Understanding
Tae Hun Kim, Hyun Gyu Lee

TL;DR
This paper introduces Bi-MCQ, a novel framework that reformulates vision-language alignment as a conditional semantic comparison task, significantly improving negation understanding in medical image analysis models.
Contribution
It proposes a bi-directional multiple-choice learning approach with direction-specific modules to enhance negation comprehension in vision-language models.
Findings
Up to 0.47 AUC improvement over state-of-the-art models.
Reduces affirmative-negative AUC gap by 0.12 on average.
Enhances negation understanding in medical VLMs.
Abstract
Recent vision-language models (VLMs) achieve strong zero-shot performance via large-scale image-text pretraining and have been widely adopted in medical image analysis. However, existing VLMs remain notably weak at understanding negated clinical statements, largely due to contrastive alignment objectives that treat negation as a minor linguistic variation rather than a meaning-inverting operator. In multi-label settings, prompt-based InfoNCE fine-tuning further reinforces easy-positive image-prompt alignments, limiting effective learning of disease absence. To overcome these limitations, we reformulate vision-language alignment as a conditional semantic comparison problem, which is instantiated through a bi-directional multiple-choice learning framework(Bi-MCQ). By jointly training Image-to-Text and Text-to-Image MCQ tasks with affirmative, negative, and mixed prompts, our method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Machine Learning in Healthcare
