VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models

Byung-Kwan Lee; Ryo Hachiuma; Yu-Chiang Frank Wang; Yong Man Ro; Yueh-Hua Wu

arXiv:2412.01822·cs.CV·October 23, 2025

VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models

Byung-Kwan Lee, Ryo Hachiuma, Yu-Chiang Frank Wang, Yong Man Ro, Yueh-Hua Wu

PDF

Open Access

TL;DR

VLsI introduces a layer-wise distillation approach with verbalizers to efficiently scale small vision-language models, achieving significant benchmark improvements without increasing model size or complexity.

Contribution

The paper presents VLsI, a novel layer-wise distillation method with verbalizers for efficient vision-language model scaling, outperforming larger models without additional scaling.

Findings

01

Achieves 11.0% and 17.4% improvements on benchmarks for 2B and 7B models.

02

Validates effectiveness across ten vision-language benchmarks.

03

Reduces computational costs while maintaining high accuracy.

Abstract

The recent surge in high-quality visual instruction tuning samples from closed-source vision-language models (VLMs) such as GPT-4V has accelerated the release of open-source VLMs across various model sizes. However, scaling VLMs to improve performance using larger models brings significant computational challenges, especially for deployment on resource-constrained devices like mobile platforms and robots. To address this, we propose VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes, which prioritizes efficiency without compromising accuracy. VLsI leverages a unique, layer-wise distillation process, introducing intermediate "verbalizers" that map features from each layer to natural language space, allowing smaller VLMs to flexibly align with the reasoning processes of larger VLMs. This approach mitigates the training instability often encountered in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies

MethodsALIGN