Language-Specific Layer Matters: Efficient Multilingual Enhancement for Large Vision-Language Models
Yuchun Fan, Yilin Wang, Yongyu Mu, Lei Huang, Bei Li, Xiaocheng Feng, Tong Xiao, and Jingbo Zhu

TL;DR
This paper introduces PLAST, a method for enhancing multilingual capabilities of large vision-language models by fine-tuning language-specific shallow layers, achieving significant improvements with minimal parameter updates.
Contribution
The paper presents a novel approach to identify and fine-tune language-specific layers in LVLMs, improving multilingual understanding efficiently.
Findings
PLAST improves multilingual performance on MM-Bench and MMMB.
Only 14% of parameters need tuning for effective enhancement.
PLAST generalizes well to low-resource and complex visual reasoning tasks.
Abstract
Large vision-language models (LVLMs) have demonstrated exceptional capabilities in understanding visual information with human languages but also exhibit an imbalance in multilingual capabilities. In this work, we delve into the multilingual working pattern of LVLMs and identify a salient correlation between the multilingual understanding ability of LVLMs and language-specific neuron activations in shallow layers. Building on this insight, we introduce PLAST, a training recipe that achieves efficient multilingual enhancement for LVLMs by Precise LAnguage-Specific layers fine-Tuning. PLAST first identifies layers involved in multilingual understanding by monitoring language-specific neuron activations. These layers are then precisely fine-tuned with question-translation pairs to achieve multilingual alignment. Our empirical results on MM-Bench and MMMB demonstrate that PLAST effectively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
