GenColorBench: A Color Evaluation Benchmark for Text-to-Image Generation Models

Muhammad Atif Butt; Alexandra Gomez-Villa; Tao Wu; Javier Vazquez-Corral; Joost Van De Weijer; and Kai Wang

arXiv:2510.20586·cs.CV·October 24, 2025

GenColorBench: A Color Evaluation Benchmark for Text-to-Image Generation Models

Muhammad Atif Butt, Alexandra Gomez-Villa, Tao Wu, Javier Vazquez-Corral, Joost Van De Weijer, and Kai Wang

PDF

Open Access 4 Reviews

TL;DR

GenColorBench is a new benchmark for evaluating the color accuracy of text-to-image models, addressing a gap in existing assessments by focusing specifically on color precision and human perception alignment.

Contribution

It introduces the first comprehensive, color-focused benchmark grounded in established color systems, with extensive prompts and evaluations to measure models' color generation capabilities.

Findings

01

Models show varied performance in color accuracy.

02

The benchmark reveals specific color conventions models understand well.

03

Failure modes in color interpretation are identified.

Abstract

Recent years have seen impressive advances in text-to-image generation, with image generative or unified models producing high-quality images from text. Yet these models still struggle with fine-grained color controllability, often failing to accurately match colors specified in text prompts. While existing benchmarks evaluate compositional reasoning and prompt adherence, none systematically assess color precision. Color is fundamental to human visual perception and communication, critical for applications from art to design workflows requiring brand consistency. However, current benchmarks either neglect color or rely on coarse assessments, missing key capabilities such as interpreting RGB values or aligning with human expectations. To this end, we propose GenColorBench, the first comprehensive benchmark for text-to-image color generation, grounded in color systems like ISCC-NBS and…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

- Thorough Data Curation: The paper is thorough in its creation of the benchmark dataset, drawing from established color systems and creating a large number of prompts (44,464). - Perceptual Metric Choice: The choice to use CIELAB color space for evaluation is sound, as it is more perceptually uniform than RGB.

Weaknesses

- Marginal/Trivial Contribution: The paper's core premise is flawed. It focuses on a niche, unimportant problem (hyper-specific color accuracy). This is a solved problem at a "good enough" level for most applications, and this benchmark does not measure any deeper semantic capability. - Flawed Methodology: The evaluation pipeline is fundamentally unsound. It relies on a VQA model (Janus-1.3B) that the paper itself proves is unreliable (Table 2). - Non-Transparent Pipeline: The methodology relie

Reviewer 02Rating 6Confidence 4

Strengths

- Addresses a Critical Gap: The paper tackles a well-motivated and highly important limitation in current T2I evaluation. Precise color control is a fundamental requirement for many practical applications, and this work provides the first systematic, large-scale tool to measure it. - Theoretically Grounded Methodology: The benchmark's design is well-founded in color science. Grounding the evaluation in established, perceptually uniform color systems like ISCC-NBS and employing the 'dominant

Weaknesses

- Benchmark Calibration Concerns: The performance scores across all evaluated models are extremely low (highest average is 22.42%). Without a human performance baseline or inter-annotator agreement study, it is difficult to ascertain whether these scores reflect genuine, severe model limitations or overly stringent evaluation criteria. This lack of calibration makes the absolute scores hard to interpret. - Arbitrary Thresholding in Evaluation Metric: The Just-Noticeable-Difference (JND) thre

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper presents a clear idea, addressing an interesting aspect of text-to-image evaluation — color understanding. 2. The writing is generally clear and structured, with sufficient methodological detail and logical flow. 3. The work offers a novel a benchmark contribution, supported by sound experimental design and comprehensive analysis across multiple models.

Weaknesses

1. Each object is evaluated with only a single dominant color, which may oversimplify real-world cases where objects naturally exhibit multiple colors or textures. 2. There is some concern about the practical relevance of the benchmark—generative models may not need to distinguish over 400 colors, many of which are not practical or barely perceptible even to humans.

Reviewer 04Rating 6Confidence 3

Strengths

1. To the best of my knowledge, color evaluation is indeed an overlooked aspect in existing T2I benchmarks, and this work therefore fills an important gap. 2. The construction method, and especially the color identification protocol (lines 288-314), seems well-thought-out. However, as I am not an expert in color systems, I am not in a position to judge the reasonability, correctness, and professionalism of this specific design. 3. The benchmark covers multiple dimensions of evaluation, which I f

Weaknesses

1. I suggest designing a hierarchy of evaluation protocols with increasingly fine-grained color divisions. At a minimum, I would recommend adding a protocol that only involves ISCC-NBS Level 1 color names. The underlying rationale is that highly fine-grained color specification currently seems to be a niche demand, and such evaluation might be more relevant for specialized models. Subjecting general-purpose models to such strict criteria may not be necessary. A protocol with a coarser color divi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Color perception and design · Computer Graphics and Visualization Techniques