Visually Grounded Reasoning across Languages and Cultures
Fangyu Liu, Emanuele Bugliarello, Edoardo Maria Ponti, Siva Reddy,, Nigel Collier, Desmond Elliott

TL;DR
This paper introduces MaRVL, a multilingual dataset for vision-language reasoning across diverse cultures and languages, highlighting current models' limitations in cross-lingual transfer and emphasizing the need for more inclusive AI systems.
Contribution
It presents a novel protocol for creating culturally and linguistically diverse vision-language datasets and demonstrates the challenges faced by current models in multilingual reasoning tasks.
Findings
State-of-the-art models perform poorly in cross-lingual transfer.
Current models are less accurate beyond English.
The new dataset reveals significant gaps in multilingual reasoning.
Abstract
The design of widespread vision-and-language datasets and pre-trained encoders directly adopts, or draws inspiration from, the concepts and images of ImageNet. While one can hardly overestimate how much this benchmark contributed to progress in computer vision, it is mostly derived from lexical databases and image queries in English, resulting in source material with a North American or Western European bias. Therefore, we devise a new protocol to construct an ImageNet-style hierarchy representative of more languages and cultures. In particular, we let the selection of both concepts and images be entirely driven by native speakers, rather than scraping them automatically. Specifically, we focus on a typologically diverse set of languages, namely, Indonesian, Mandarin Chinese, Swahili, Tamil, and Turkish. On top of the concepts and images obtained through this new protocol, we create a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
Methods7 Fastest Ways to Call American Airlines Reservations Number (USA Guide)
