Parrot: Multilingual Visual Instruction Tuning

Hai-Long Sun; Da-Wei Zhou; Yang Li; Shiyin Lu; Chao Yi; Qing-Guo Chen; Zhao Xu; Weihua Luo; Kaifu Zhang; De-Chuan Zhan; Han-Jia Ye

arXiv:2406.02539·cs.CV·May 27, 2025·1 cites

Parrot: Multilingual Visual Instruction Tuning

Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye

PDF

Open Access 2 Repos 2 Models 1 Datasets

TL;DR

PARROT introduces a multilingual visual instruction tuning method that improves alignment of visual tokens with diverse languages, enhancing performance across multiple languages and multimodal tasks.

Contribution

It proposes a novel language-guided visual token alignment approach using textual guidance and MoE, addressing multilingual token alignment issues in multimodal models.

Findings

01

Achieves state-of-the-art results on multilingual benchmarks

02

Effectively aligns visual tokens with multiple languages

03

Demonstrates improved performance on diverse multimodal tasks

Abstract

The rapid development of Multimodal Large Language Models (MLLMs), such as GPT-4o, marks a significant step toward artificial general intelligence. Existing methods typically align vision encoders with LLMs via supervised fine-tuning (SFT), but this often deteriorates their ability to handle multiple languages as training progresses. We empirically observe that imbalanced SFT datasets, largely English-centric, degrade performance on non-English languages due to the failure in multilingual token alignment. To address this, we propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level. PARROT conditions visual tokens on diverse language inputs and uses Mixture-of-Experts (MoE) to align multilingual tokens. By computing cross-attention between initial visual features and textual embeddings, we select the most relevant experts,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

AIDC-AI/Parrot-dataset
dataset· 232 dl
232 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEFL/ESL Teaching and Learning

MethodsAttention Is All You Need · Sigmoid Activation · Tanh Activation · Long Short-Term Memory · Softmax · Focus · Mixture of Experts · Linear Layer · Parrot · Shrink and Fine-Tune