Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models
Po-Yao Huang, Mandela Patrick, Junjie Hu, Graham Neubig, Florian Metze, and Alexander Hauptmann

TL;DR
This paper introduces a multilingual multimodal pre-training approach with a new dataset to enhance zero-shot cross-lingual video and image search capabilities of vision-language models, significantly improving performance in non-English languages.
Contribution
It proposes a novel multilingual multimodal pre-training strategy and a new dataset, MultiHowTo100M, to improve zero-shot cross-lingual vision-language tasks.
Findings
Significant performance degradation occurs in non-English queries without pre-training.
Pre-training with MultiHowTo100M improves non-English video search results.
Our method outperforms recent baselines in multilingual text-to-video and text-to-image search.
Abstract
This paper studies zero-shot cross-lingual transfer of vision-language models. Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextualized multilingual multimodal embeddings. Under a zero-shot setting, we empirically demonstrate that performance degrades significantly when we query the multilingual text-video model with non-English sentences. To address this problem, we introduce a multilingual multimodal pre-training strategy, and collect a new multilingual instructional video dataset (MultiHowTo100M) for pre-training. Experiments on VTT show that our method significantly improves video search in non-English languages without additional annotations. Furthermore, when multilingual annotations are available, our method outperforms recent baselines by a large margin in multilingual text-to-video search on VTT and VATEX; as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
