World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models

Eunsu Kim; Junyeong Park; Na Min An; Junseong Kim; Hitesh Laxmichand Patel; Jiho Jin; Julia Kruk; Amit Agarwal; Srikant Panda; Fenal Ashokbhai Ilasariya; Hyunjung Shim; Alice Oh

arXiv:2511.22787·cs.CV·December 11, 2025

World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models

Eunsu Kim, Junyeong Park, Na Min An, Junseong Kim, Hitesh Laxmichand Patel, Jiho Jin, Julia Kruk, Amit Agarwal, Srikant Panda, Fenal Ashokbhai Ilasariya, Hyunjung Shim, Alice Oh

PDF

Open Access 1 Datasets

TL;DR

This paper investigates how large vision-language models handle culture mixing in images, revealing significant challenges and proposing robustness strategies to improve their cultural understanding and consistency.

Contribution

It introduces the CultureMix benchmark for evaluating LVLMs on culture mixing scenarios and demonstrates the effectiveness of supervised fine-tuning in enhancing model robustness.

Findings

01

LVLMs often fail to preserve cultural identities in mixed scenes.

02

Models heavily rely on backgrounds, reducing accuracy by 14%.

03

Supervised fine-tuning improves model consistency and reduces background sensitivity.

Abstract

In a globalized world, cultural elements from diverse origins frequently appear together within a single visual scene. We refer to these as culture mixing scenarios, yet how Large Vision-Language Models (LVLMs) perceive them remains underexplored. We investigate culture mixing as a critical challenge for LVLMs and examine how current models behave when cultural items from multiple regions appear together. To systematically analyze these behaviors, we construct CultureMix, a food Visual Question Answering (VQA) benchmark with 23k diffusion-generated, human-verified culture mixing images across four subtasks: (1) food-only, (2) food+food, (3) food+background, and (4) food+food+background. Evaluating 10 LVLMs, we find consistent failures to preserve individual cultural identities in mixed settings. Models show strong background reliance, with accuracy dropping 14% when cultural backgrounds…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

EunsuKim/CultureMix
dataset· 104 dl
104 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques