Benchmarking Vision Language Models for Cultural Understanding

Shravan Nayak; Kanishk Jain; Rabiul Awal; Siva Reddy; Sjoerd van; Steenkiste; Lisa Anne Hendricks; Karolina Sta\'nczak; Aishwarya Agrawal

arXiv:2407.10920·cs.CV·October 15, 2024

Benchmarking Vision Language Models for Cultural Understanding

Shravan Nayak, Kanishk Jain, Rabiul Awal, Siva Reddy, Sjoerd van, Steenkiste, Lisa Anne Hendricks, Karolina Sta\'nczak, Aishwarya Agrawal

PDF

Open Access 1 Video

TL;DR

This paper introduces CulturalVQA, a benchmark for evaluating vision-language models' understanding of diverse cultural concepts across regions, revealing disparities in performance and highlighting areas for improvement.

Contribution

It presents a new culturally diverse VQA benchmark and analyzes current VLMs' performance, exposing gaps in cultural understanding across regions and facets.

Findings

01

VLMs perform better on North American cultures.

02

Significant performance gaps exist for African cultures.

03

Performance varies across cultural facets like clothing and rituals.

Abstract

Foundation models and vision-language pre-training have notably advanced Vision Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their performance has been typically assessed on general scene understanding - recognizing objects, attributes, and actions - rather than cultural comprehension. This study introduces CulturalVQA, a visual question-answering benchmark aimed at assessing VLM's geo-diverse cultural understanding. We curate a collection of 2,378 image-question pairs with 1-5 answers per question representing cultures from 11 countries across 5 continents. The questions probe understanding of various facets of culture such as clothing, food, drinks, rituals, and traditions. Benchmarking VLMs on CulturalVQA, including GPT-4V and Gemini, reveals disparity in their level of cultural understanding across regions, with strong cultural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Benchmarking Vision Language Models for Cultural Understanding· underline

Taxonomy

TopicsLanguage, Metaphor, and Cognition

MethodsSparse Evolutionary Training