Representation Learning with Semantic-aware Instance and Sparse Token Alignments

Phuoc-Nguyen Bui; Toan Duc Nguyen; Junghyun Bum; Duc-Tai Le; Hyunseung Choo

arXiv:2601.08165·cs.CV·April 2, 2026

Representation Learning with Semantic-aware Instance and Sparse Token Alignments

Phuoc-Nguyen Bui, Toan Duc Nguyen, Junghyun Bum, Duc-Tai Le, Hyunseung Choo

PDF

TL;DR

This paper introduces SISTA, a semantic-aware multi-level alignment framework for medical vision-language pre-training that improves representation quality by addressing false negatives and aligning image patches with relevant words.

Contribution

The paper proposes a novel multi-level alignment framework that enhances medical VLP by incorporating semantic correspondence and sparse token alignments, improving downstream task performance.

Findings

01

Improves transfer performance across multiple datasets and tasks.

02

Achieves significant gains in fine-grained tasks with limited labeled data.

03

Effectively reduces false negatives in contrastive learning.

Abstract

Medical contrastive vision-language pre-training (VLP) has demonstrated significant potential in improving performance on downstream tasks. Traditional approaches typically employ contrastive learning, treating paired image-report samples as positives and unpaired ones as negatives. However, in medical datasets, there can be substantial similarities between images or reports from different patients. Rigidly treating all unpaired samples as negatives, can disrupt the underlying semantic structure and negatively impact the quality of the learned representations. In this paper, we propose a multi-level alignment framework, Representation Learning with Semantic-aware Instance and Sparse Token Alignments (SISTA) by exploiting the semantic correspondence between medical image and radiology reports at two levels, i.e., image-report and patch-word levels. Specifically, we improve the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.