cViL: Cross-Lingual Training of Vision-Language Models using Knowledge   Distillation

Kshitij Gupta; Devansh Gautam; Radhika Mamidi

arXiv:2206.03354·cs.CL·June 10, 2022

cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation

Kshitij Gupta, Devansh Gautam, Radhika Mamidi

PDF

Open Access 1 Repo

TL;DR

This paper introduces cViL, a method for training vision-language models in new languages using knowledge distillation from English models, achieving improved accuracy in Japanese and Hindi visual question answering tasks.

Contribution

The paper presents a novel cross-lingual training pipeline that leverages English models and knowledge distillation to efficiently extend vision-language models to other languages.

Findings

01

Outperforms state-of-the-art by 4.4% in Japanese VQA accuracy.

02

Outperforms state-of-the-art by 13.4% in Hindi VQA accuracy.

03

Provides a large-scale VQA dataset in Japanese and Hindi.

Abstract

Vision-and-language tasks are gaining popularity in the research community, but the focus is still mainly on English. We propose a pipeline that utilizes English-only vision-language models to train a monolingual model for a target language. We propose to extend OSCAR+, a model which leverages object tags as anchor points for learning image-text alignments, to train on visual question answering datasets in different languages. We propose a novel approach to knowledge distillation to train the model in other languages using parallel sentences. Compared to other models that use the target language in the pretraining corpora, we can leverage an existing English model to transfer the knowledge to the target language using significantly lesser resources. We also release a large-scale visual question answering dataset in Japanese and Hindi language. Though we restrict our work to visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kshitij98/cvil
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsKnowledge Distillation