FaNe: Towards Fine-Grained Cross-Modal Contrast with False-Negative Reduction and Text-Conditioned Sparse Attention

Peng Zhang; Zhihui Lai; Wenting Chen; Xu Wu; Heng Kong

arXiv:2511.12215·cs.CV·November 18, 2025

FaNe: Towards Fine-Grained Cross-Modal Contrast with False-Negative Reduction and Text-Conditioned Sparse Attention

Peng Zhang, Zhihui Lai, Wenting Chen, Xu Wu, Heng Kong

PDF

Open Access 1 Video

TL;DR

FaNe is a novel medical vision-language pre-training framework that reduces false negatives and enhances fine-grained cross-modal alignment using semantic-aware strategies and text-conditioned attention, leading to state-of-the-art results.

Contribution

Introduces a semantic-aware positive pair mining and text-conditioned sparse attention to improve medical VLP by reducing false negatives and enabling detailed image-text alignment.

Findings

01

Achieves state-of-the-art performance on five medical imaging benchmarks.

02

Effectively reduces false negatives with adaptive reweighting.

03

Enhances fine-grained cross-modal alignment through localized visual representations.

Abstract

Medical vision-language pre-training (VLP) offers significant potential for advancing medical image understanding by leveraging paired image-report data. However, existing methods are limited by Fa}lse Negatives (FaNe) induced by semantically similar texts and insufficient fine-grained cross-modal alignment. To address these limitations, we propose FaNe, a semantic-enhanced VLP framework. To mitigate false negatives, we introduce a semantic-aware positive pair mining strategy based on text-text similarity with adaptive normalization. Furthermore, we design a text-conditioned sparse attention pooling module to enable fine-grained image-text alignment through localized visual representations guided by textual cues. To strengthen intra-modal discrimination, we develop a hard-negative aware contrastive loss that adaptively reweights semantically similar negatives. Extensive experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

FaNe: Towards Fine-Grained Cross-Modal Contrast with False-Negative Reduction and Text-Conditioned Sparse Attention· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis