AGA: An adaptive group alignment framework for structured medical cross-modal representation learning
Wei Li, Xun Gong, Jiao Li, Xiaobin Sun

TL;DR
This paper introduces AGA, a novel framework for structured medical image-report representation learning that dynamically groups visual and linguistic features, improving alignment without relying on large negative sample sets.
Contribution
The paper proposes a new adaptive grouping mechanism and an instance-aware alignment loss to better capture structured semantics in medical cross-modal data.
Findings
Achieves superior image-text retrieval performance
Improves classification accuracy in medical datasets
Operates effectively in zero-shot and fine-tuning scenarios
Abstract
Learning medical visual representations from paired images and reports is a promising direction in representation learning. However, current vision-language pretraining methods in the medical domain often simplify clinical reports into single entities or fragmented tokens, ignoring their inherent structure. In addition, contrastive learning frameworks typically depend on large quantities of hard negative samples, which is impractical for small-scale medical datasets. To tackle these challenges, we propose Adaptive Grouped Alignment (AGA), a new framework that captures structured semantics from paired medical images and reports. AGA introduces a bidirectional grouping mechanism based on a sparse similarity matrix. For each image-report pair, we compute fine-grained similarities between text tokens and image patches. Each token selects its top-matching patches to form a visual group, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Topic Modeling
