CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark
David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy, Hamed, Aditya Nanda Kishore, Aishik Mandal, Alina Dragonetti, Artem Abzaliev,, Atnafu Lambebo Tonja, Bontu Fufa Balcha, Chenxi Whitehouse, Christian, Salamea, Dan John Velasco, David Ifeoluwa Adelani

TL;DR
CVQA is a new multilingual and culturally-diverse VQA benchmark with images and questions from 30 countries, designed to evaluate and improve the cultural understanding of multimodal AI models.
Contribution
The paper introduces CVQA, a culturally-diverse multilingual VQA dataset with native speaker input, covering 31 languages and 30 countries, and benchmarks current models on this challenging dataset.
Findings
Current models struggle with CVQA's cultural and linguistic diversity.
CVQA reveals biases and limitations in existing multimodal models.
The dataset encourages development of culturally-aware AI models.
Abstract
Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, although these datasets often extend their linguistic range via translation or some other approaches, they usually keep images the same, resulting in narrow cultural representation. To address these limitations, we construct CVQA, a new Culturally-diverse multilingual Visual Question Answering benchmark, designed to cover a rich set of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Text and Document Classification Technologies · Advanced Image and Video Retrieval Techniques
MethodsSparse Evolutionary Training
