Toward Socially Aware Vision-Language Models: Evaluating Cultural Competence Through Multimodal Story Generation
Arka Mukherjee, Shreya Ghosh

TL;DR
This paper evaluates the cultural competence of vision-language models in multimodal story generation, revealing both their potential for cultural adaptation and significant limitations across architectures and metrics.
Contribution
It introduces the first systematic framework for assessing cultural awareness in multimodal VLMs through story generation tasks.
Findings
Models show rich culturally-specific vocabulary usage.
Cultural competence varies significantly across architectures.
Automated metrics often contradict human assessments.
Abstract
As Vision-Language Models (VLMs) achieve widespread deployment across diverse cultural contexts, ensuring their cultural competence becomes critical for responsible AI systems. While prior work has evaluated cultural awareness in text-only models and VLM object recognition tasks, no research has systematically assessed how VLMs adapt outputs when cultural identity cues are embedded in both textual prompts and visual inputs during generative tasks. We present the first comprehensive evaluation of VLM cultural competence through multimodal story generation, developing a novel multimodal framework that perturbs cultural identity and evaluates 5 contemporary VLMs on a downstream task: story generation. Our analysis reveals significant cultural adaptation capabilities, with rich culturally-specific vocabulary spanning names, familial terms, and geographic markers. However, we uncover…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
