Culture in Action: Evaluating Text-to-Image Models through Social Activities

Sina Malakouti; Boqing Gong; Adriana Kovashka

arXiv:2511.05681·cs.CV·March 9, 2026

Culture in Action: Evaluating Text-to-Image Models through Social Activities

Sina Malakouti, Boqing Gong, Adriana Kovashka

PDF

Open Access 3 Reviews

TL;DR

This paper introduces CULTIVate, a comprehensive benchmark for evaluating text-to-image models on their ability to accurately depict cross-cultural social activities, highlighting disparities and biases in current models.

Contribution

The paper presents CULTIVate, a new benchmark with metrics for assessing cultural faithfulness in T2I models across diverse social activities and regions, addressing gaps in existing evaluations.

Findings

01

Models perform better for global north countries.

02

Systematic disparities and failure modes identified.

03

Metrics correlate strongly with human judgments.

Abstract

Text-to-image (T2I) diffusion models achieve impressive photorealism by training on large-scale web data, but models inherit cultural biases and fail to depict underrepresented regions faithfully. Existing cultural benchmarks focus mainly on object-centric categories (e.g., food, attire, and architecture), overlooking the social and daily activities that more clearly reflect cultural norms. Few metrics exist for measuring cultural faithfulness. We introduce CULTIVate, a benchmark for evaluating T2I models on cross-cultural activities (e.g., greetings, dining, games, traditional dances, and cultural celebrations). CULTIVate spans 16 countries with 576 prompts and more than 19,000 images, and provides an explainable descriptor-based evaluation framework across multiple cultural dimensions, including background, attire, objects, and interactions. We propose four metrics to measure cultural…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 5

Strengths

1. The authors introduce new metrics to measure how well T2I models depict cultural activities. As T2I models are being deployed in many parts of the world, this is a very important problem that needs to be tackled, and reliable metrics are essential. 2. The idea behind the metrics is well motivated. I agree that existing works mostly quantify alignment and quality metrics and might miss some nuances. Going beyond these and explicitly calculating hallucination and exaggeration is a nice directio

Weaknesses

1. **CULTIVate dataset:** The authors introduce a large scale dataset comprising 576 prompts and 19k images, but it is not clear what the utility of the dataset is beyond reporting the AHEaD metrics on them. Many of these image-prompt pairs have no human annotations to compare the correlation of metrics with. Could the authors provide more insights on this? 2. **Human Study:** The authors have conducted a very small human study with a limited number of prompts, i.e, only 9 activities per countr

Reviewer 02Rating 4Confidence 4

Strengths

1) Interpretable diagnostics: Can list top/bottom descriptors per country/activity to see what is missing or overdone, useful for model iteration 2) No per-image human labels are needed to compute metrics, and it shows good improvements over existing metrics.

Weaknesses

1) Benchmark spans 16 countries, but the human study covers only a smaller subset (7 countries), with no clear rationale or GN/GS/per-country annotation breakdown 2) Descriptors are produced with GPT-4o and Gemini-2.5 Flash, yet “MLLM-as-judge” baselines cover only InternVL/QwenVL - no strong judge like Gemini 2.5 Flash/pro or GPT-4o-vision. Makes it hard to evaluate whether different descriptors are needed and to justify that we need more than one Judge model.

Reviewer 03Rating 6Confidence 5

Strengths

* This works makes contributions towards cultural benchmarking of text-to-image models, which is a relatively underexplored and important area of research * They provide a new benchmark that is focused on cultural activities in contrast to prior work that focuses more on cultural objects (e.g. landmarks). The benchmark size (576 prompts), categorical coverage (9 super categories of activities) and regional coverage (16 countries) is of similar order to prior work [1- 3, 5]. * The proposed AHeAD

Weaknesses

## Major Weaknesses: * I disagree with a central claim of the authors, i.e. that their benchmark "... capture the contextual complexity **missing from object-centric benchmarks**" (L74). Prior work has already evaluated on contextually-specific cultural activities, e.g. celebrations / festivals / performance arts [1- 5], weddings [4], religious activities [1], sports [1, 2, 6], cooking [6]. The authors claim that their activity-focused benchmark highlights "**new failure patterns**" such as "ha

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Language and cultural evolution · Media Influence and Health