FaNe: Towards Fine-Grained Cross-Modal Contrast with False-Negative Reduction and Text-Conditioned Sparse Attention
Peng Zhang, Zhihui Lai, Wenting Chen, Xu Wu, Heng Kong

TL;DR
FaNe is a novel medical vision-language pre-training framework that reduces false negatives and enhances fine-grained cross-modal alignment using semantic-aware strategies and text-conditioned attention, leading to state-of-the-art results.
Contribution
Introduces a semantic-aware positive pair mining and text-conditioned sparse attention to improve medical VLP by reducing false negatives and enabling detailed image-text alignment.
Findings
Achieves state-of-the-art performance on five medical imaging benchmarks.
Effectively reduces false negatives with adaptive reweighting.
Enhances fine-grained cross-modal alignment through localized visual representations.
Abstract
Medical vision-language pre-training (VLP) offers significant potential for advancing medical image understanding by leveraging paired image-report data. However, existing methods are limited by Fa}lse Negatives (FaNe) induced by semantically similar texts and insufficient fine-grained cross-modal alignment. To address these limitations, we propose FaNe, a semantic-enhanced VLP framework. To mitigate false negatives, we introduce a semantic-aware positive pair mining strategy based on text-text similarity with adaptive normalization. Furthermore, we design a text-conditioned sparse attention pooling module to enable fine-grained image-text alignment through localized visual representations guided by textual cues. To strengthen intra-modal discrimination, we develop a hard-negative aware contrastive loss that adaptively reweights semantically similar negatives. Extensive experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
