Modest-Align: Data-Efficient Alignment for Vision-Language Models

Jiaxiang Liu; Yuan Wang; Jiawei Du; Joey Tianyi Zhou; Mingkun Xu; Zuozhu Liu

arXiv:2510.21606·cs.CV·October 27, 2025

Modest-Align: Data-Efficient Alignment for Vision-Language Models

Jiaxiang Liu, Yuan Wang, Jiawei Du, Joey Tianyi Zhou, Mingkun Xu, Zuozhu Liu

PDF

Open Access

TL;DR

Modest-Align is a lightweight, robust framework for vision-language model alignment that improves performance in low-resource settings by reducing overconfidence through noise simulation and embedding calibration.

Contribution

We introduce Modest-Align, a novel alignment method that enhances robustness and efficiency in low-data regimes using two complementary strategies.

Findings

01

Outperforms state-of-the-art in retrieval tasks

02

Achieves similar results with 100x less data

03

Requires 600x less GPU time than CLIP

Abstract

Cross-modal alignment aims to map heterogeneous modalities into a shared latent space, as exemplified by models like CLIP, which benefit from large-scale image-text pretraining for strong recognition capabilities. However, when operating in resource-constrained settings with limited or low-quality data, these models often suffer from overconfidence and degraded performance due to the prevalence of ambiguous or weakly correlated image-text pairs. Current contrastive learning approaches, which rely on single positive pairs, further exacerbate this issue by reinforcing overconfidence on uncertain samples. To address these challenges, we propose Modest-Align, a lightweight alignment framework designed for robustness and efficiency. Our approach leverages two complementary strategies -- Random Perturbation, which introduces controlled noise to simulate uncertainty, and Embedding Smoothing,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning