ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs

Zhipin Wang; Christoph Leiter; Christian Frey; Mohamed Hesham Ibrahim Abdalla; Josif Grabocka; Steffen Eger

arXiv:2604.06484·cs.CL·April 16, 2026

ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs

Zhipin Wang, Christoph Leiter, Christian Frey, Mohamed Hesham Ibrahim Abdalla, Josif Grabocka, Steffen Eger

PDF

TL;DR

ValueGround introduces a benchmark to evaluate how well multimodal large language models can ground culture-conditioned judgments in visual scenes, revealing challenges in cross-modal cultural understanding.

Contribution

It presents a new benchmark, ValueGround, for assessing culture-conditioned visual value grounding in multimodal models using real-world survey data.

Findings

01

Average accuracy drops from 72.8% to 65.8% when visual options are used.

02

Models achieve 92.8% accuracy on option-image alignment.

03

Stronger models are more robust but still prone to prediction reversals.

Abstract

Cultural values are expressed not only through language but also through visual scenes and everyday social practices. Yet existing evaluations of cultural values in language models are almost entirely text-only, making it unclear whether models can ground culture-conditioned judgments when response options are visualized. We introduce ValueGround, a benchmark for evaluating culture-conditioned visual value grounding in multimodal large language models (MLLMs). Built from World Values Survey (WVS) questions, ValueGround uses minimally contrastive image pairs to represent opposing response options while controlling irrelevant variation. Given a country, a question, and an image pair, a model must choose the image that best matches the country's value tendency without access to the original response-option texts. Across six MLLMs and 13 countries, average accuracy drops from 72.8% in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.