T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation

Kaiyue Sun; Rongyao Fang; Chengqi Duan; Xian Liu; Xihui Liu

arXiv:2508.17472·cs.CV·August 26, 2025

T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation

Kaiyue Sun, Rongyao Fang, Chengqi Duan, Xian Liu, Xihui Liu

PDF

4 Reviews

TL;DR

T2I-ReasonBench is a comprehensive benchmark designed to evaluate the reasoning abilities of text-to-image models across four key dimensions, using a two-stage protocol to assess reasoning accuracy and image quality.

Contribution

It introduces a novel benchmark with a structured evaluation protocol for reasoning in text-to-image models, covering four reasoning dimensions.

Findings

01

Benchmarking reveals varying reasoning capabilities among models.

02

Two-stage evaluation effectively separates reasoning accuracy from image quality.

03

Provides detailed analysis of model performances across different reasoning tasks.

Abstract

We propose T2I-ReasonBench, a benchmark evaluating reasoning capabilities of text-to-image (T2I) models. It consists of four dimensions: Idiom Interpretation, Textual Image Design, Entity-Reasoning and Scientific-Reasoning. We propose a two-stage evaluation protocol to assess the reasoning accuracy and image quality. We benchmark various T2I generation models, and provide comprehensive analysis on their performances.

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 5

Strengths

- It moves beyond compositionality and literal alignment toward reasoning-aware evaluation — an underexplored but crucial capability for T2I systems. - It covers a wide range of models, including diffusion, unified, and proprietary systems.

Weaknesses

- The work relies excessively on LLMs, yet lacks a thorough verification process for hallucinations or incorrect responses. - Although several other related metrics already exist—such as TIFA [1], and I-HallA [2]—the paper does not provide any comparison with them. - The paper attempts to tackle four major challenges at once, resulting in an unfocused contribution. Since the benchmark is not large-scale, it should have been carefully curated; however, it ends up being of ambiguous size and quali

Reviewer 02Rating 6Confidence 4

Strengths

1. This work addresses a critical gap by focusing on reasoning capabilities in T2I generation, an aspect that previous benchmarks largely ignored. By challenging models with idioms, complex design tasks, entity knowledge, and scientific scenarios, it goes beyond surface-level prompt-to-image alignment to evaluate deeper understanding and inference in image generation. The benchmark comprises 800 carefully curated prompts spanning four diverse reasoning dimensions. This thorough coverage ensures

Weaknesses

1. The evaluation framework heavily relies on an LLM and a multimodal model as judges, which introduces potential bias and uncertainty in the scoring. Although the authors demonstrate that their metric aligns well with human evaluations, the dependence on AI evaluators (which have their own limitations) raises concerns about whether the scores always faithfully reflect human-perceived reasoning quality. The two-stage evaluation process is fairly complex and computationally intensive. It depends

Reviewer 03Rating 4Confidence 4

Strengths

- The paper addresses a timely and critical problem in generative AI: moving beyond surface-level text-image alignment to evaluate the deeper reasoning capabilities of T2I models. This is an important direction for the field. - The proposed benchmark is reasonably comprehensive, with four distinct dimensions that probe different facets of reasoning, from figurative language (idioms) and creative planning (textual design) to world knowledge (entities) and physical principles (scientific reasoning

Weaknesses

- Potential for evaluator bias: The framework uses Qwen2.5-VL as the automated scorer. Given that Qwen-Image is one of the top-performing open-source models under evaluation, this raises a serious concern about potential 'in-family' bias. The human correlation analysis, conducted on an unspecified subset of only 5 models, is not sufficient to rule out this potential bias across all 16 evaluated models. This concern undermines the reliability of the reported model rankings. - Arbitrary evaluation

Reviewer 04Rating 4Confidence 3

Strengths

1. This work addresses the significant and under-explored problem of evaluating T2I models' reasoning capabilities, moving beyond existing benchmarks that focus on literal prompt-image alignment. 2. The benchmark introduces novel dimensions, "Idiom Interpretation" and "Textual Image Design," which challenge models with complex, abstract tasks that require inferring implicit information rather than just following explicit instructions. 3. Through a comprehensive evaluation of 16 SOTA models and a

Weaknesses

1. The benchmark's core "AI-evaluating-AI" evaluation framework is a key weakness, as its reliability depends entirely on the AI models used for evaluation. a. If the criteria-generating LLM (DeepSeek-R1) itself possesses biases, knowledge gaps, or reasoning errors, it will produce flawed question-criterion pairs from the very start. b. This pipeline is susceptible to compounding errors, where any biases or misunderstandings from the LLM in the first stage are amplified by the MLLM's own limit

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.