Color Bind: Exploring Color Perception in Text-to-Image Models
Shay Shomer-Chai, Wenxuan Peng, Bharath Hariharan, Hadar Averbuch-Elor

TL;DR
This paper investigates how well text-to-image models capture color semantics in prompts, revealing limitations and proposing a new image editing method to improve multi-color object alignment.
Contribution
It provides a detailed case study on color perception in text-to-image models and introduces a novel editing technique to enhance multi-object color accuracy.
Findings
Pretrained models struggle with multi-color prompts.
Existing inference-time and editing methods are insufficient.
Proposed editing technique significantly improves color alignment.
Abstract
Text-to-image generation has recently seen remarkable success, granting users with the ability to create high-quality images through the use of text. However, contemporary methods face challenges in capturing the precise semantics conveyed by complex multi-object prompts. Consequently, many works have sought to mitigate such semantic misalignments, typically via inference-time schemes that modify the attention layers of the denoising networks. However, prior work has mostly utilized coarse metrics, such as the cosine similarity between text and image CLIP embeddings, or human evaluations, which are challenging to conduct on a larger-scale. In this work, we perform a case study on colors -- a fundamental attribute commonly associated with objects in text prompts, which offer a rich test bed for rigorous evaluation. Our analysis reveals that pretrained models struggle to generate images…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsColor perception and design
