Toward Socially Aware Vision-Language Models: Evaluating Cultural Competence Through Multimodal Story Generation

Arka Mukherjee; Shreya Ghosh

arXiv:2508.16762·cs.CL·August 26, 2025

Toward Socially Aware Vision-Language Models: Evaluating Cultural Competence Through Multimodal Story Generation

Arka Mukherjee, Shreya Ghosh

PDF

1 Datasets

TL;DR

This paper evaluates the cultural competence of vision-language models in multimodal story generation, revealing both their potential for cultural adaptation and significant limitations across architectures and metrics.

Contribution

It introduces the first systematic framework for assessing cultural awareness in multimodal VLMs through story generation tasks.

Findings

01

Models show rich culturally-specific vocabulary usage.

02

Cultural competence varies significantly across architectures.

03

Automated metrics often contradict human assessments.

Abstract

As Vision-Language Models (VLMs) achieve widespread deployment across diverse cultural contexts, ensuring their cultural competence becomes critical for responsible AI systems. While prior work has evaluated cultural awareness in text-only models and VLM object recognition tasks, no research has systematically assessed how VLMs adapt outputs when cultural identity cues are embedded in both textual prompts and visual inputs during generative tasks. We present the first comprehensive evaluation of VLM cultural competence through multimodal story generation, developing a novel multimodal framework that perturbs cultural identity and evaluates 5 contemporary VLMs on a downstream task: story generation. Our analysis reveals significant cultural adaptation capabilities, with rich culturally-specific vocabulary spanning names, familial terms, and geographic markers. However, we uncover…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ArkaMukherjee/mmCultural
dataset· 13 dl
13 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.