CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics

Shravan Nayak; Mehar Bhatia; Xiaofeng Zhang; Verena Rieser; Lisa Anne Hendricks; Sjoerd van Steenkiste; Yash Goyal; Karolina Sta\'nczak; Aishwarya Agrawal

arXiv:2506.08835·cs.CV·January 21, 2026

CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics

Shravan Nayak, Mehar Bhatia, Xiaofeng Zhang, Verena Rieser, Lisa Anne Hendricks, Sjoerd van Steenkiste, Yash Goyal, Karolina Sta\'nczak, Aishwarya Agrawal

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces CulturalFrames, a benchmark for evaluating how well text-to-image models and their metrics align with diverse cultural expectations, revealing significant gaps and poor correlation with human judgments.

Contribution

It presents the first systematic assessment of cultural expectation alignment in T2I models and evaluation metrics using a comprehensive human-annotated benchmark across multiple countries.

Findings

01

Cultural expectations are missed 44% of the time across models and countries.

02

Explicit cultural expectations are missed 68% of the time.

03

Existing evaluation metrics correlate poorly with human judgments.

Abstract

The increasing ubiquity of text-to-image (T2I) models as tools for visual content generation raises concerns about their ability to accurately represent diverse cultural contexts -- where missed cues can stereotype communities and undermine usability. In this work, we present the first study to systematically quantify the alignment of T2I models and evaluation metrics with respect to both explicit (stated) as well as implicit (unstated, implied by the prompt's cultural context) cultural expectations. To this end, we introduce CulturalFrames, a novel benchmark designed for rigorous human evaluation of cultural representation in visual generations. Spanning 10 countries and 5 socio-cultural domains, CulturalFrames comprises 983 prompts, 3637 corresponding images generated by 4 state-of-the-art T2I models, and over 10k detailed human annotations. We find that across models and countries,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

mair-lab/CulturalFrames
dataset· 109 dl
109 dl

Videos

CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics· underline

Taxonomy

TopicsInnovative Human-Technology Interaction · Multimodal Machine Learning Applications · Data Visualization and Analytics