Modest-Align: Data-Efficient Alignment for Vision-Language Models
Jiaxiang Liu, Yuan Wang, Jiawei Du, Joey Tianyi Zhou, Mingkun Xu, Zuozhu Liu

TL;DR
Modest-Align is a lightweight, robust framework for vision-language model alignment that improves performance in low-resource settings by reducing overconfidence through noise simulation and embedding calibration.
Contribution
We introduce Modest-Align, a novel alignment method that enhances robustness and efficiency in low-data regimes using two complementary strategies.
Findings
Outperforms state-of-the-art in retrieval tasks
Achieves similar results with 100x less data
Requires 600x less GPU time than CLIP
Abstract
Cross-modal alignment aims to map heterogeneous modalities into a shared latent space, as exemplified by models like CLIP, which benefit from large-scale image-text pretraining for strong recognition capabilities. However, when operating in resource-constrained settings with limited or low-quality data, these models often suffer from overconfidence and degraded performance due to the prevalence of ambiguous or weakly correlated image-text pairs. Current contrastive learning approaches, which rely on single positive pairs, further exacerbate this issue by reinforcing overconfidence on uncertain samples. To address these challenges, we propose Modest-Align, a lightweight alignment framework designed for robustness and efficiency. Our approach leverages two complementary strategies -- Random Perturbation, which introduces controlled noise to simulate uncertainty, and Embedding Smoothing,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
