TL;DR
This paper introduces M$^{3}$T2IBench, a comprehensive benchmark for evaluating text-to-image models on complex multi-instance prompts, along with a new alignment metric and a post-editing method to improve model performance.
Contribution
It presents a large-scale, multi-category benchmark with an object-detection-based metric and a training-free post-editing approach to enhance image-text alignment.
Findings
Current models perform poorly on the benchmark.
The Revise-Then-Enforce method improves alignment.
AlignScore correlates well with human judgment.
Abstract
Text-to-image models are known to struggle with generating images that perfectly align with textual prompts. Several previous studies have focused on evaluating image-text alignment in text-to-image generation. However, these evaluations either address overly simple scenarios, especially overlooking the difficulty of prompts with multiple different instances belonging to the same category, or they introduce metrics that do not correlate well with human evaluation. In this study, we introduce MT2IBench, a large-scale, multi-category, multi-instance, multi-relation along with an object-detection-based evaluation metric, , which aligns closely with human evaluation. Our findings reveal that current open-source text-to-image models perform poorly on this challenging benchmark. Additionally, we propose the Revise-Then-Enforce approach to enhance image-text alignment. This…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper’s original contribution focuses on real world scenarios which other metrics lack. The prompt construction method is simple, general and easily applicable across diffusion models, and needs no retraining. The metrics are defined clearly and human ratings align better with AlignScore compared to alternatives. The paper is easy to read and the information flows logically. The paper’s contribution lies in coming up with a structured large-scale benchmark making it valuable for real world u
In the paper, the attributes used are mostly focused on color. Adding additional attributes like shape can help make accuracy apply to real world use cases and not just color and position. DALLE-3 comparison has only 100 prompts which is relatively smaller. Revise and Enforce is assumed to depend on correctly identifying the failed parts. It would be great to explain the method used to identify the failed parts and turn them into prompts.
1. Significant Benchmark Scale and Complexity: It presents largest T2I compositional alignment dataset to date (10k prompts), and supports long prompts, many relations, and multiple instances per category 2. Fine-Grained and Structured Metric Design. The paper evaluates object counts, colors, spatial relations via automated detectors, and distinguishes between bias (count errors) and accuracy (attributes/relations)
The benchmark depends heavily on automatic detectors for object, color, and relation evaluation. While this enables scalability, it also inherits all failure cases of those detectors, particularly in stylized or abstract generations. The lack of reported human validation raises questions about the accuracy of the scores and whether false detector errors might inflate failure rates. Another limitation is the synthetic nature of prompt construction. Prompts are generated using probabilistic rules
1. The paper proposes to evaluate both acc and bias, with an exhaustive searching method to determine the final acc. The idea of evaluating two metrics is reasonable. 2. The paper is well-written and easy to follow.
1. Lack of novelty of the proposed revise-then-enforce method. First, the method appears to be hardly related to the proposed benchmark and scoring metrics. This makes the paper separate into two unrelated parts. Additionally, a similar idea has already been explored in earlier works, but the paper fails to mention [1]. 2. In Line 201, the benchmark refrains from mentioning words like fresh and majestically to ensure accurate evaluation. This idea does not seem reasonable and therefore severely
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
