TL;DR
This paper introduces ALTA, an efficient method for medical vision-language alignment that adapts pretrained masked vision models with minimal training, significantly improving retrieval and classification performance.
Contribution
ALTA leverages pretrained masked vision models for efficient vision-language alignment with minimal parameters and computation, outperforming existing methods in medical imaging tasks.
Findings
ALTA achieves over 4% improvement in text-to-image accuracy.
ALTA improves image-to-text retrieval accuracy by about 6%.
The method requires only 8% of trainable parameters compared to traditional approaches.
Abstract
Medical vision-language alignment through cross-modal contrastive learning shows promising performance in image-text matching tasks, such as retrieval and zero-shot classification. However, conventional cross-modal contrastive learning (CLIP-based) methods suffer from suboptimal visual representation capabilities, which also limits their effectiveness in vision-language alignment. In contrast, although the models pretrained via multimodal masked modeling struggle with direct cross-modal matching, they excel in visual representation. To address this contradiction, we propose ALTA (ALign Through Adapting), an efficient medical vision-language alignment method that utilizes only about 8% of the trainable parameters and less than 1/5 of the computational consumption required for masked record modeling. ALTA achieves superior performance in vision-language matching tasks like retrieval and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning
