CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries
Shudong Liu, Yiqiao Jin, Cheng Li, Derek F. Wong, Qingsong Wen, Lichao, Sun, Haipeng Chen, Xing Xie, Jindong Wang

TL;DR
This paper introduces CultureVLM, a new benchmark and fine-tuned models to improve cultural understanding in vision-language models across over 100 countries, addressing biases and disparities in current systems.
Contribution
The paper constructs CultureVerse, a large-scale cultural benchmark, and proposes CultureVLM, a series of models fine-tuned on this dataset to enhance multicultural understanding in VLMs.
Findings
Fine-tuning improves cultural perception across diverse regions.
Models perform better on Western concepts than African and Asian ones.
Cultural understanding can be enhanced without sacrificing general VLM performance.
Abstract
Vision-language models (VLMs) have advanced human-AI interaction but struggle with cultural understanding, often misinterpreting symbols, gestures, and artifacts due to biases in predominantly Western-centric training data. In this paper, we construct CultureVerse, a large-scale multimodal benchmark covering 19, 682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3 question types, with the aim of characterizing and improving VLMs' multicultural understanding capabilities. Then, we propose CultureVLM, a series of VLMs fine-tuned on our dataset to achieve significant performance improvement in cultural understanding. Our evaluation of 16 models reveals significant disparities, with a stronger performance in Western concepts and weaker results in African and Asian contexts. Fine-tuning on our CultureVerse enhances cultural perception, demonstrating cross-cultural,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedia, Religion, Digital Communication
MethodsAttentive Walk-Aggregating Graph Neural Network
