Generalizing Multimodal Pre-training into Multilingual via Language Acquisition
Liang Zhang, Anwen Hu, Qin Jin

TL;DR
This paper introduces a lightweight multilingual acquisition framework that efficiently extends monolingual vision-language models to multiple languages, achieving state-of-the-art results with less data and computational resources.
Contribution
The proposed MLA framework enables flexible multilingual extension of monolingual VLP models using a lightweight encoder and a two-stage training strategy.
Findings
Achieves state-of-the-art performance on multilingual retrieval benchmarks.
Requires less multilingual data and computing resources.
Effectively generalizes monolingual models to multiple languages.
Abstract
English-based Vision-Language Pre-training (VLP) has achieved great success in various downstream tasks. Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training (M-VLP). However, due to the large number of languages, M-VLP models often require huge computing resources and cannot be flexibly extended to new languages. In this work, we propose a \textbf{M}ulti\textbf{L}ingual \textbf{A}cquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual. Specifically, we design a lightweight language acquisition encoder based on state-of-the-art monolingual VLP models. We further propose a two-stage training strategy to optimize the language acquisition encoder, namely the Native Language Transfer stage and the Language Exposure stage. With much less multilingual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
