Delving Deeper into Cross-lingual Visual Question Answering
Chen Liu, Jonas Pfeiffer, Anna Korhonen, Ivan Vuli\'c, Iryna Gurevych

TL;DR
This paper investigates cross-lingual visual question answering (VQA), analyzing how modeling choices and biases affect zero-shot transfer performance across languages and question types, and proposes simple training modifications to improve results.
Contribution
It provides a detailed analysis of factors influencing cross-lingual VQA performance and introduces simple training modifications that significantly reduce transfer gaps.
Findings
Simple training modifications improve transfer accuracy by +10 points.
Certain question types are more challenging to improve across languages.
Biases in training data explain persistent zero-shot performance gaps.
Abstract
Visual question answering (VQA) is one of the crucial vision-and-language tasks. Yet, existing VQA research has mostly focused on the English language, due to a lack of suitable evaluation resources. Previous work on cross-lingual VQA has reported poor zero-shot transfer performance of current multilingual multimodal Transformers with large gaps to monolingual performance, without any deeper analysis. In this work, we delve deeper into the different aspects of cross-lingual VQA, aiming to understand the impact of 1) modeling methods and choices, including architecture, inductive bias, fine-tuning; 2) learning biases: including question types and modality biases in cross-lingual setups. The key results of our analysis are: 1) We show that simple modifications to the standard training setup can substantially reduce the transfer gap to monolingual English performance, yielding +10 accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
