When Cultures Meet: Multicultural Text-to-Image Generation

Parth Bhalerao; Mounika Yalamarty; Brian Trinh; Oana Ignat

arXiv:2502.15972·cs.CV·April 20, 2026

When Cultures Meet: Multicultural Text-to-Image Generation

Parth Bhalerao, Mounika Yalamarty, Brian Trinh, Oana Ignat

PDF

2 Repos 1 Datasets

TL;DR

This paper introduces a new benchmark and dataset for multicultural text-to-image generation, analyzing model performance across cultural, demographic, and linguistic dimensions, and proposing a multi-agent framework to improve results.

Contribution

It presents the first benchmark for multicultural image generation, a new dataset, and a multi-agent framework leveraging LLMs to enhance cultural grounding in generated images.

Findings

01

Rich prompt composition improves image quality and cultural accuracy.

02

Models show disparities across languages and demographic groups.

03

The dataset and code are publicly released for further research.

Abstract

Text-to-image generation models have achieved strong performance in culturally homogeneous settings, yet their ability to generate multicultural scenes, where people and landmarks originate from different cultures, remains largely unexplored. We introduce multicultural text-to-image generation as a new task and present the first benchmark designed to study this setting. Our dataset contains 9,000 images spanning five countries, three age groups, two genders, 25 historical landmarks, and five languages. Using this benchmark, we analyze the behavior of state-of-the-art text-to-image models across multiple dimensions, including alignment, image quality, aesthetics, knowledge, and fairness. As one strategy for composing cultural and demographic information, we explore MosAIG, a Multi-Agent framework that enhances multicultural Image Generation by leveraging LLMs with distinct cultural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

AIM-SCU/Multi-Cultural-Single-Multi-Agent-Images
dataset· 98 dl
98 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.