When Cultures Meet: Multicultural Text-to-Image Generation
Parth Bhalerao, Mounika Yalamarty, Brian Trinh, Oana Ignat

TL;DR
This paper introduces a new benchmark and dataset for multicultural text-to-image generation, analyzing model performance across cultural, demographic, and linguistic dimensions, and proposing a multi-agent framework to improve results.
Contribution
It presents the first benchmark for multicultural image generation, a new dataset, and a multi-agent framework leveraging LLMs to enhance cultural grounding in generated images.
Findings
Rich prompt composition improves image quality and cultural accuracy.
Models show disparities across languages and demographic groups.
The dataset and code are publicly released for further research.
Abstract
Text-to-image generation models have achieved strong performance in culturally homogeneous settings, yet their ability to generate multicultural scenes, where people and landmarks originate from different cultures, remains largely unexplored. We introduce multicultural text-to-image generation as a new task and present the first benchmark designed to study this setting. Our dataset contains 9,000 images spanning five countries, three age groups, two genders, 25 historical landmarks, and five languages. Using this benchmark, we analyze the behavior of state-of-the-art text-to-image models across multiple dimensions, including alignment, image quality, aesthetics, knowledge, and fairness. As one strategy for composing cultural and demographic information, we explore MosAIG, a Multi-Agent framework that enhances multicultural Image Generation by leveraging LLMs with distinct cultural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
