GenColorBench: A Color Evaluation Benchmark for Text-to-Image Generation Models
Muhammad Atif Butt, Alexandra Gomez-Villa, Tao Wu, Javier Vazquez-Corral, Joost Van De Weijer, and Kai Wang

TL;DR
GenColorBench is a new benchmark for evaluating the color accuracy of text-to-image models, addressing a gap in existing assessments by focusing specifically on color precision and human perception alignment.
Contribution
It introduces the first comprehensive, color-focused benchmark grounded in established color systems, with extensive prompts and evaluations to measure models' color generation capabilities.
Findings
Models show varied performance in color accuracy.
The benchmark reveals specific color conventions models understand well.
Failure modes in color interpretation are identified.
Abstract
Recent years have seen impressive advances in text-to-image generation, with image generative or unified models producing high-quality images from text. Yet these models still struggle with fine-grained color controllability, often failing to accurately match colors specified in text prompts. While existing benchmarks evaluate compositional reasoning and prompt adherence, none systematically assess color precision. Color is fundamental to human visual perception and communication, critical for applications from art to design workflows requiring brand consistency. However, current benchmarks either neglect color or rely on coarse assessments, missing key capabilities such as interpreting RGB values or aligning with human expectations. To this end, we propose GenColorBench, the first comprehensive benchmark for text-to-image color generation, grounded in color systems like ISCC-NBS and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Thorough Data Curation: The paper is thorough in its creation of the benchmark dataset, drawing from established color systems and creating a large number of prompts (44,464). - Perceptual Metric Choice: The choice to use CIELAB color space for evaluation is sound, as it is more perceptually uniform than RGB.
- Marginal/Trivial Contribution: The paper's core premise is flawed. It focuses on a niche, unimportant problem (hyper-specific color accuracy). This is a solved problem at a "good enough" level for most applications, and this benchmark does not measure any deeper semantic capability. - Flawed Methodology: The evaluation pipeline is fundamentally unsound. It relies on a VQA model (Janus-1.3B) that the paper itself proves is unreliable (Table 2). - Non-Transparent Pipeline: The methodology relie
- Addresses a Critical Gap: The paper tackles a well-motivated and highly important limitation in current T2I evaluation. Precise color control is a fundamental requirement for many practical applications, and this work provides the first systematic, large-scale tool to measure it. - Theoretically Grounded Methodology: The benchmark's design is well-founded in color science. Grounding the evaluation in established, perceptually uniform color systems like ISCC-NBS and employing the 'dominant
- Benchmark Calibration Concerns: The performance scores across all evaluated models are extremely low (highest average is 22.42%). Without a human performance baseline or inter-annotator agreement study, it is difficult to ascertain whether these scores reflect genuine, severe model limitations or overly stringent evaluation criteria. This lack of calibration makes the absolute scores hard to interpret. - Arbitrary Thresholding in Evaluation Metric: The Just-Noticeable-Difference (JND) thre
1. The paper presents a clear idea, addressing an interesting aspect of text-to-image evaluation — color understanding. 2. The writing is generally clear and structured, with sufficient methodological detail and logical flow. 3. The work offers a novel a benchmark contribution, supported by sound experimental design and comprehensive analysis across multiple models.
1. Each object is evaluated with only a single dominant color, which may oversimplify real-world cases where objects naturally exhibit multiple colors or textures. 2. There is some concern about the practical relevance of the benchmark—generative models may not need to distinguish over 400 colors, many of which are not practical or barely perceptible even to humans.
1. To the best of my knowledge, color evaluation is indeed an overlooked aspect in existing T2I benchmarks, and this work therefore fills an important gap. 2. The construction method, and especially the color identification protocol (lines 288-314), seems well-thought-out. However, as I am not an expert in color systems, I am not in a position to judge the reasonability, correctness, and professionalism of this specific design. 3. The benchmark covers multiple dimensions of evaluation, which I f
1. I suggest designing a hierarchy of evaluation protocols with increasingly fine-grained color divisions. At a minimum, I would recommend adding a protocol that only involves ISCC-NBS Level 1 color names. The underlying rationale is that highly fine-grained color specification currently seems to be a niche demand, and such evaluation might be more relevant for specialized models. Subjecting general-purpose models to such strict criteria may not be necessary. A protocol with a coarser color divi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Color perception and design · Computer Graphics and Visualization Techniques
