Parrot: Multilingual Visual Instruction Tuning
Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye

TL;DR
PARROT introduces a multilingual visual instruction tuning method that improves alignment of visual tokens with diverse languages, enhancing performance across multiple languages and multimodal tasks.
Contribution
It proposes a novel language-guided visual token alignment approach using textual guidance and MoE, addressing multilingual token alignment issues in multimodal models.
Findings
Achieves state-of-the-art results on multilingual benchmarks
Effectively aligns visual tokens with multiple languages
Demonstrates improved performance on diverse multimodal tasks
Abstract
The rapid development of Multimodal Large Language Models (MLLMs), such as GPT-4o, marks a significant step toward artificial general intelligence. Existing methods typically align vision encoders with LLMs via supervised fine-tuning (SFT), but this often deteriorates their ability to handle multiple languages as training progresses. We empirically observe that imbalanced SFT datasets, largely English-centric, degrade performance on non-English languages due to the failure in multilingual token alignment. To address this, we propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level. PARROT conditions visual tokens on diverse language inputs and uses Mixture-of-Experts (MoE) to align multilingual tokens. By computing cross-attention between initial visual features and textual embeddings, we select the most relevant experts,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEFL/ESL Teaching and Learning
MethodsAttention Is All You Need · Sigmoid Activation · Tanh Activation · Long Short-Term Memory · Softmax · Focus · Mixture of Experts · Linear Layer · Parrot · Shrink and Fine-Tune
