ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs
Zhipin Wang, Christoph Leiter, Christian Frey, Mohamed Hesham Ibrahim Abdalla, Josif Grabocka, Steffen Eger

TL;DR
ValueGround introduces a benchmark to evaluate how well multimodal large language models can ground culture-conditioned judgments in visual scenes, revealing challenges in cross-modal cultural understanding.
Contribution
It presents a new benchmark, ValueGround, for assessing culture-conditioned visual value grounding in multimodal models using real-world survey data.
Findings
Average accuracy drops from 72.8% to 65.8% when visual options are used.
Models achieve 92.8% accuracy on option-image alignment.
Stronger models are more robust but still prone to prediction reversals.
Abstract
Cultural values are expressed not only through language but also through visual scenes and everyday social practices. Yet existing evaluations of cultural values in language models are almost entirely text-only, making it unclear whether models can ground culture-conditioned judgments when response options are visualized. We introduce ValueGround, a benchmark for evaluating culture-conditioned visual value grounding in multimodal large language models (MLLMs). Built from World Values Survey (WVS) questions, ValueGround uses minimally contrastive image pairs to represent opposing response options while controlling irrelevant variation. Given a country, a question, and an image pair, a model must choose the image that best matches the country's value tendency without access to the original response-option texts. Across six MLLMs and 13 countries, average accuracy drops from 72.8% in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
