CultureVLM: Characterizing and Improving Cultural Understanding of   Vision-Language Models for over 100 Countries

Shudong Liu; Yiqiao Jin; Cheng Li; Derek F. Wong; Qingsong Wen; Lichao; Sun; Haipeng Chen; Xing Xie; Jindong Wang

arXiv:2501.01282·cs.AI·January 3, 2025·2 cites

CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries

Shudong Liu, Yiqiao Jin, Cheng Li, Derek F. Wong, Qingsong Wen, Lichao, Sun, Haipeng Chen, Xing Xie, Jindong Wang

PDF

Open Access

TL;DR

This paper introduces CultureVLM, a new benchmark and fine-tuned models to improve cultural understanding in vision-language models across over 100 countries, addressing biases and disparities in current systems.

Contribution

The paper constructs CultureVerse, a large-scale cultural benchmark, and proposes CultureVLM, a series of models fine-tuned on this dataset to enhance multicultural understanding in VLMs.

Findings

01

Fine-tuning improves cultural perception across diverse regions.

02

Models perform better on Western concepts than African and Asian ones.

03

Cultural understanding can be enhanced without sacrificing general VLM performance.

Abstract

Vision-language models (VLMs) have advanced human-AI interaction but struggle with cultural understanding, often misinterpreting symbols, gestures, and artifacts due to biases in predominantly Western-centric training data. In this paper, we construct CultureVerse, a large-scale multimodal benchmark covering 19, 682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3 question types, with the aim of characterizing and improving VLMs' multicultural understanding capabilities. Then, we propose CultureVLM, a series of VLMs fine-tuned on our dataset to achieve significant performance improvement in cultural understanding. Our evaluation of 16 models reveals significant disparities, with a stronger performance in Western concepts and weaker results in African and Asian contexts. Fine-tuning on our CultureVerse enhances cultural perception, demonstrating cross-cultural,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedia, Religion, Digital Communication

MethodsAttentive Walk-Aggregating Graph Neural Network