Delving Deeper into Cross-lingual Visual Question Answering

Chen Liu; Jonas Pfeiffer; Anna Korhonen; Ivan Vuli\'c; Iryna Gurevych

arXiv:2202.07630·cs.CL·June 12, 2023·1 cites

Delving Deeper into Cross-lingual Visual Question Answering

Chen Liu, Jonas Pfeiffer, Anna Korhonen, Ivan Vuli\'c, Iryna Gurevych

PDF

Open Access 1 Repo

TL;DR

This paper investigates cross-lingual visual question answering (VQA), analyzing how modeling choices and biases affect zero-shot transfer performance across languages and question types, and proposes simple training modifications to improve results.

Contribution

It provides a detailed analysis of factors influencing cross-lingual VQA performance and introduces simple training modifications that significantly reduce transfer gaps.

Findings

01

Simple training modifications improve transfer accuracy by +10 points.

02

Certain question types are more challenging to improve across languages.

03

Biases in training data explain persistent zero-shot performance gaps.

Abstract

Visual question answering (VQA) is one of the crucial vision-and-language tasks. Yet, existing VQA research has mostly focused on the English language, due to a lack of suitable evaluation resources. Previous work on cross-lingual VQA has reported poor zero-shot transfer performance of current multilingual multimodal Transformers with large gaps to monolingual performance, without any deeper analysis. In this work, we delve deeper into the different aspects of cross-lingual VQA, aiming to understand the impact of 1) modeling methods and choices, including architecture, inductive bias, fine-tuning; 2) learning biases: including question types and modality biases in cross-lingual setups. The key results of our analysis are: 1) We show that simple modifications to the standard training setup can substantially reduce the transfer gap to monolingual English performance, yielding +10 accuracy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ukplab/eacl2023-xlingvqa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning