Curriculum Script Distillation for Multilingual Visual Question Answering
Khyathi Raghavi Chandu, Alborz Geramifard

TL;DR
This paper introduces a curriculum script distillation method to improve multilingual visual question answering by leveraging language script similarities, significantly enhancing performance across languages.
Contribution
It proposes a novel curriculum-based fine-tuning approach that utilizes language script information to boost multilingual VQA performance.
Findings
Languages sharing the same script improve performance by ~6%.
Mixed-script code-switched languages outperform others by 5-12%.
Script plays a vital role in multilingual VQA effectiveness.
Abstract
Pre-trained models with dual and cross encoders have shown remarkable success in propelling the landscape of several tasks in vision and language in Visual Question Answering (VQA). However, since they are limited by the requirements of gold annotated data, most of these advancements do not see the light of day in other languages beyond English. We aim to address this problem by introducing a curriculum based on the source and target language translations to finetune the pre-trained models for the downstream task. Experimental results demonstrate that script plays a vital role in the performance of these models. Specifically, we show that target languages that share the same script perform better (~6%) than other languages and mixed-script code-switched languages perform better than their counterparts (~5-12%).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
