Generalizing Multimodal Pre-training into Multilingual via Language   Acquisition

Liang Zhang; Anwen Hu; Qin Jin

arXiv:2206.11091·cs.CL·June 23, 2022·5 cites

Generalizing Multimodal Pre-training into Multilingual via Language Acquisition

Liang Zhang, Anwen Hu, Qin Jin

PDF

Open Access

TL;DR

This paper introduces a lightweight multilingual acquisition framework that efficiently extends monolingual vision-language models to multiple languages, achieving state-of-the-art results with less data and computational resources.

Contribution

The proposed MLA framework enables flexible multilingual extension of monolingual VLP models using a lightweight encoder and a two-stage training strategy.

Findings

01

Achieves state-of-the-art performance on multilingual retrieval benchmarks.

02

Requires less multilingual data and computing resources.

03

Effectively generalizes monolingual models to multiple languages.

Abstract

English-based Vision-Language Pre-training (VLP) has achieved great success in various downstream tasks. Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training (M-VLP). However, due to the large number of languages, M-VLP models often require huge computing resources and cannot be flexibly extended to new languages. In this work, we propose a \textbf{M}ulti\textbf{L}ingual \textbf{A}cquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual. Specifically, we design a lightweight language acquisition encoder based on state-of-the-art monolingual VLP models. We further propose a two-stage training strategy to optimize the language acquisition encoder, namely the Native Language Transfer stage and the Language Exposure stage. With much less multilingual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques