AGA: An adaptive group alignment framework for structured medical cross-modal representation learning

Wei Li; Xun Gong; Jiao Li; Xiaobin Sun

arXiv:2507.23402·cs.CV·August 1, 2025

AGA: An adaptive group alignment framework for structured medical cross-modal representation learning

Wei Li, Xun Gong, Jiao Li, Xiaobin Sun

PDF

Open Access

TL;DR

This paper introduces AGA, a novel framework for structured medical image-report representation learning that dynamically groups visual and linguistic features, improving alignment without relying on large negative sample sets.

Contribution

The paper proposes a new adaptive grouping mechanism and an instance-aware alignment loss to better capture structured semantics in medical cross-modal data.

Findings

01

Achieves superior image-text retrieval performance

02

Improves classification accuracy in medical datasets

03

Operates effectively in zero-shot and fine-tuning scenarios

Abstract

Learning medical visual representations from paired images and reports is a promising direction in representation learning. However, current vision-language pretraining methods in the medical domain often simplify clinical reports into single entities or fragmented tokens, ignoring their inherent structure. In addition, contrastive learning frameworks typically depend on large quantities of hard negative samples, which is impractical for small-scale medical datasets. To tackle these challenges, we propose Adaptive Grouped Alignment (AGA), a new framework that captures structured semantics from paired medical images and reports. AGA introduces a bidirectional grouping mechanism based on a sparse similarity matrix. For each image-report pair, we compute fine-grained similarities between text tokens and image patches. Each token selects its top-matching patches to form a visual group, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Topic Modeling