TL;DR
T2I-ReasonBench is a comprehensive benchmark designed to evaluate the reasoning abilities of text-to-image models across four key dimensions, using a two-stage protocol to assess reasoning accuracy and image quality.
Contribution
It introduces a novel benchmark with a structured evaluation protocol for reasoning in text-to-image models, covering four reasoning dimensions.
Findings
Benchmarking reveals varying reasoning capabilities among models.
Two-stage evaluation effectively separates reasoning accuracy from image quality.
Provides detailed analysis of model performances across different reasoning tasks.
Abstract
We propose T2I-ReasonBench, a benchmark evaluating reasoning capabilities of text-to-image (T2I) models. It consists of four dimensions: Idiom Interpretation, Textual Image Design, Entity-Reasoning and Scientific-Reasoning. We propose a two-stage evaluation protocol to assess the reasoning accuracy and image quality. We benchmark various T2I generation models, and provide comprehensive analysis on their performances.
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- It moves beyond compositionality and literal alignment toward reasoning-aware evaluation — an underexplored but crucial capability for T2I systems. - It covers a wide range of models, including diffusion, unified, and proprietary systems.
- The work relies excessively on LLMs, yet lacks a thorough verification process for hallucinations or incorrect responses. - Although several other related metrics already exist—such as TIFA [1], and I-HallA [2]—the paper does not provide any comparison with them. - The paper attempts to tackle four major challenges at once, resulting in an unfocused contribution. Since the benchmark is not large-scale, it should have been carefully curated; however, it ends up being of ambiguous size and quali
1. This work addresses a critical gap by focusing on reasoning capabilities in T2I generation, an aspect that previous benchmarks largely ignored. By challenging models with idioms, complex design tasks, entity knowledge, and scientific scenarios, it goes beyond surface-level prompt-to-image alignment to evaluate deeper understanding and inference in image generation. The benchmark comprises 800 carefully curated prompts spanning four diverse reasoning dimensions. This thorough coverage ensures
1. The evaluation framework heavily relies on an LLM and a multimodal model as judges, which introduces potential bias and uncertainty in the scoring. Although the authors demonstrate that their metric aligns well with human evaluations, the dependence on AI evaluators (which have their own limitations) raises concerns about whether the scores always faithfully reflect human-perceived reasoning quality. The two-stage evaluation process is fairly complex and computationally intensive. It depends
- The paper addresses a timely and critical problem in generative AI: moving beyond surface-level text-image alignment to evaluate the deeper reasoning capabilities of T2I models. This is an important direction for the field. - The proposed benchmark is reasonably comprehensive, with four distinct dimensions that probe different facets of reasoning, from figurative language (idioms) and creative planning (textual design) to world knowledge (entities) and physical principles (scientific reasoning
- Potential for evaluator bias: The framework uses Qwen2.5-VL as the automated scorer. Given that Qwen-Image is one of the top-performing open-source models under evaluation, this raises a serious concern about potential 'in-family' bias. The human correlation analysis, conducted on an unspecified subset of only 5 models, is not sufficient to rule out this potential bias across all 16 evaluated models. This concern undermines the reliability of the reported model rankings. - Arbitrary evaluation
1. This work addresses the significant and under-explored problem of evaluating T2I models' reasoning capabilities, moving beyond existing benchmarks that focus on literal prompt-image alignment. 2. The benchmark introduces novel dimensions, "Idiom Interpretation" and "Textual Image Design," which challenge models with complex, abstract tasks that require inferring implicit information rather than just following explicit instructions. 3. Through a comprehensive evaluation of 16 SOTA models and a
1. The benchmark's core "AI-evaluating-AI" evaluation framework is a key weakness, as its reliability depends entirely on the AI models used for evaluation. a. If the criteria-generating LLM (DeepSeek-R1) itself possesses biases, knowledge gaps, or reasoning errors, it will produce flawed question-criterion pairs from the very start. b. This pipeline is susceptible to compounding errors, where any biases or misunderstandings from the LLM in the first stage are amplified by the MLLM's own limit
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
