Benchmarking Vision Language Models for Cultural Understanding
Shravan Nayak, Kanishk Jain, Rabiul Awal, Siva Reddy, Sjoerd van, Steenkiste, Lisa Anne Hendricks, Karolina Sta\'nczak, Aishwarya Agrawal

TL;DR
This paper introduces CulturalVQA, a benchmark for evaluating vision-language models' understanding of diverse cultural concepts across regions, revealing disparities in performance and highlighting areas for improvement.
Contribution
It presents a new culturally diverse VQA benchmark and analyzes current VLMs' performance, exposing gaps in cultural understanding across regions and facets.
Findings
VLMs perform better on North American cultures.
Significant performance gaps exist for African cultures.
Performance varies across cultural facets like clothing and rituals.
Abstract
Foundation models and vision-language pre-training have notably advanced Vision Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their performance has been typically assessed on general scene understanding - recognizing objects, attributes, and actions - rather than cultural comprehension. This study introduces CulturalVQA, a visual question-answering benchmark aimed at assessing VLM's geo-diverse cultural understanding. We curate a collection of 2,378 image-question pairs with 1-5 answers per question representing cultures from 11 countries across 5 continents. The questions probe understanding of various facets of culture such as clothing, food, drinks, rituals, and traditions. Benchmarking VLMs on CulturalVQA, including GPT-4V and Gemini, reveals disparity in their level of cultural understanding across regions, with strong cultural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsLanguage, Metaphor, and Cognition
MethodsSparse Evolutionary Training
