Culture in Action: Evaluating Text-to-Image Models through Social Activities
Sina Malakouti, Boqing Gong, Adriana Kovashka

TL;DR
This paper introduces CULTIVate, a comprehensive benchmark for evaluating text-to-image models on their ability to accurately depict cross-cultural social activities, highlighting disparities and biases in current models.
Contribution
The paper presents CULTIVate, a new benchmark with metrics for assessing cultural faithfulness in T2I models across diverse social activities and regions, addressing gaps in existing evaluations.
Findings
Models perform better for global north countries.
Systematic disparities and failure modes identified.
Metrics correlate strongly with human judgments.
Abstract
Text-to-image (T2I) diffusion models achieve impressive photorealism by training on large-scale web data, but models inherit cultural biases and fail to depict underrepresented regions faithfully. Existing cultural benchmarks focus mainly on object-centric categories (e.g., food, attire, and architecture), overlooking the social and daily activities that more clearly reflect cultural norms. Few metrics exist for measuring cultural faithfulness. We introduce CULTIVate, a benchmark for evaluating T2I models on cross-cultural activities (e.g., greetings, dining, games, traditional dances, and cultural celebrations). CULTIVate spans 16 countries with 576 prompts and more than 19,000 images, and provides an explainable descriptor-based evaluation framework across multiple cultural dimensions, including background, attire, objects, and interactions. We propose four metrics to measure cultural…
Peer Reviews
Decision·ICLR 2026 Poster
1. The authors introduce new metrics to measure how well T2I models depict cultural activities. As T2I models are being deployed in many parts of the world, this is a very important problem that needs to be tackled, and reliable metrics are essential. 2. The idea behind the metrics is well motivated. I agree that existing works mostly quantify alignment and quality metrics and might miss some nuances. Going beyond these and explicitly calculating hallucination and exaggeration is a nice directio
1. **CULTIVate dataset:** The authors introduce a large scale dataset comprising 576 prompts and 19k images, but it is not clear what the utility of the dataset is beyond reporting the AHEaD metrics on them. Many of these image-prompt pairs have no human annotations to compare the correlation of metrics with. Could the authors provide more insights on this? 2. **Human Study:** The authors have conducted a very small human study with a limited number of prompts, i.e, only 9 activities per countr
1) Interpretable diagnostics: Can list top/bottom descriptors per country/activity to see what is missing or overdone, useful for model iteration 2) No per-image human labels are needed to compute metrics, and it shows good improvements over existing metrics.
1) Benchmark spans 16 countries, but the human study covers only a smaller subset (7 countries), with no clear rationale or GN/GS/per-country annotation breakdown 2) Descriptors are produced with GPT-4o and Gemini-2.5 Flash, yet “MLLM-as-judge” baselines cover only InternVL/QwenVL - no strong judge like Gemini 2.5 Flash/pro or GPT-4o-vision. Makes it hard to evaluate whether different descriptors are needed and to justify that we need more than one Judge model.
* This works makes contributions towards cultural benchmarking of text-to-image models, which is a relatively underexplored and important area of research * They provide a new benchmark that is focused on cultural activities in contrast to prior work that focuses more on cultural objects (e.g. landmarks). The benchmark size (576 prompts), categorical coverage (9 super categories of activities) and regional coverage (16 countries) is of similar order to prior work [1- 3, 5]. * The proposed AHeAD
## Major Weaknesses: * I disagree with a central claim of the authors, i.e. that their benchmark "... capture the contextual complexity **missing from object-centric benchmarks**" (L74). Prior work has already evaluated on contextually-specific cultural activities, e.g. celebrations / festivals / performance arts [1- 5], weddings [4], religious activities [1], sports [1, 2, 6], cooking [6]. The authors claim that their activity-focused benchmark highlights "**new failure patterns**" such as "ha
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Language and cultural evolution · Media Influence and Health
