Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models

Chenyu Lian; Hong-Yu Zhou; Dongyun Liang; Jing Qin; Liansheng Wang

arXiv:2506.08990·cs.CV·June 11, 2025

Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models

Chenyu Lian, Hong-Yu Zhou, Dongyun Liang, Jing Qin, Liansheng Wang

PDF

1 Repo

TL;DR

This paper introduces ALTA, an efficient method for medical vision-language alignment that adapts pretrained masked vision models with minimal training, significantly improving retrieval and classification performance.

Contribution

ALTA leverages pretrained masked vision models for efficient vision-language alignment with minimal parameters and computation, outperforming existing methods in medical imaging tasks.

Findings

01

ALTA achieves over 4% improvement in text-to-image accuracy.

02

ALTA improves image-to-text retrieval accuracy by about 6%.

03

The method requires only 8% of trainable parameters compared to traditional approaches.

Abstract

Medical vision-language alignment through cross-modal contrastive learning shows promising performance in image-text matching tasks, such as retrieval and zero-shot classification. However, conventional cross-modal contrastive learning (CLIP-based) methods suffer from suboptimal visual representation capabilities, which also limits their effectiveness in vision-language alignment. In contrast, although the models pretrained via multimodal masked modeling struggle with direct cross-modal matching, they excel in visual representation. To address this contradiction, we propose ALTA (ALign Through Adapting), an efficient medical vision-language alignment method that utilizes only about 8% of the trainable parameters and less than 1/5 of the computational consumption required for masked record modeling. ALTA achieves superior performance in vision-language matching tasks like retrieval and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dopaminelcy/alta
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Learning