cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation
Kshitij Gupta, Devansh Gautam, Radhika Mamidi

TL;DR
This paper introduces cViL, a method for training vision-language models in new languages using knowledge distillation from English models, achieving improved accuracy in Japanese and Hindi visual question answering tasks.
Contribution
The paper presents a novel cross-lingual training pipeline that leverages English models and knowledge distillation to efficiently extend vision-language models to other languages.
Findings
Outperforms state-of-the-art by 4.4% in Japanese VQA accuracy.
Outperforms state-of-the-art by 13.4% in Hindi VQA accuracy.
Provides a large-scale VQA dataset in Japanese and Hindi.
Abstract
Vision-and-language tasks are gaining popularity in the research community, but the focus is still mainly on English. We propose a pipeline that utilizes English-only vision-language models to train a monolingual model for a target language. We propose to extend OSCAR+, a model which leverages object tags as anchor points for learning image-text alignments, to train on visual question answering datasets in different languages. We propose a novel approach to knowledge distillation to train the model in other languages using parallel sentences. Compared to other models that use the target language in the pretraining corpora, we can leverage an existing English model to transfer the knowledge to the target language using significantly lesser resources. We also release a large-scale visual question answering dataset in Japanese and Hindi language. Though we restrict our work to visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsKnowledge Distillation
