TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks

Yuanze Hu; Zhaoxin Fan; Xinyu Wang; Gen Li; Ye Qiu; Zhichao Yang; Wenjun Wu; Kejian Wu; Yifan Sun; Xiaotie Deng; Jin Dong

arXiv:2505.12884·cs.LG·July 1, 2025

TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks

Yuanze Hu, Zhaoxin Fan, Xinyu Wang, Gen Li, Ye Qiu, Zhichao Yang, Wenjun Wu, Kejian Wu, Yifan Sun, Xiaotie Deng, Jin Dong

PDF

Open Access

TL;DR

TinyAlign enhances lightweight vision-language models by retrieving relevant context to improve modal alignment, leading to better performance, faster training, and higher data efficiency in resource-limited settings.

Contribution

We introduce TinyAlign, a retrieval-based framework that mitigates modal alignment bottlenecks in lightweight VLMs, supported by a mutual information perspective.

Findings

01

Reduces training loss and accelerates convergence

02

Achieves baseline performance with only 40% of fine-tuning data

03

Provides a theoretical understanding of alignment bottlenecks

Abstract

Lightweight Vision-Language Models (VLMs) are indispensable for resource-constrained applications. The prevailing approach to aligning vision and language models involves freezing both the vision encoder and the language model while training small connector modules. However, this strategy heavily depends on the intrinsic capabilities of the language model, which can be suboptimal for lightweight models with limited representational capacity. In this work, we investigate this alignment bottleneck through the lens of mutual information, demonstrating that the constrained capacity of the language model inherently limits the Effective Mutual Information (EMI) between multimodal inputs and outputs, thereby compromising alignment quality. To address this challenge, we propose TinyAlign, a novel framework inspired by Retrieval-Augmented Generation, which strategically retrieves relevant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications