M$^{3}$T2IBench: A Large-Scale Multi-Category, Multi-Instance, Multi-Relation Text-to-Image Benchmark

Huixuan Zhang; Xiaojun Wan

arXiv:2510.23020·cs.CV·October 28, 2025

M$^{3}$T2IBench: A Large-Scale Multi-Category, Multi-Instance, Multi-Relation Text-to-Image Benchmark

Huixuan Zhang, Xiaojun Wan

PDF

3 Reviews

TL;DR

This paper introduces M$^{3}$T2IBench, a comprehensive benchmark for evaluating text-to-image models on complex multi-instance prompts, along with a new alignment metric and a post-editing method to improve model performance.

Contribution

It presents a large-scale, multi-category benchmark with an object-detection-based metric and a training-free post-editing approach to enhance image-text alignment.

Findings

01

Current models perform poorly on the benchmark.

02

The Revise-Then-Enforce method improves alignment.

03

AlignScore correlates well with human judgment.

Abstract

Text-to-image models are known to struggle with generating images that perfectly align with textual prompts. Several previous studies have focused on evaluating image-text alignment in text-to-image generation. However, these evaluations either address overly simple scenarios, especially overlooking the difficulty of prompts with multiple different instances belonging to the same category, or they introduce metrics that do not correlate well with human evaluation. In this study, we introduce M $^{3}$ T2IBench, a large-scale, multi-category, multi-instance, multi-relation along with an object-detection-based evaluation metric, $A l i g n S cor e$ , which aligns closely with human evaluation. Our findings reveal that current open-source text-to-image models perform poorly on this challenging benchmark. Additionally, we propose the Revise-Then-Enforce approach to enhance image-text alignment. This…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 3

Strengths

The paper’s original contribution focuses on real world scenarios which other metrics lack. The prompt construction method is simple, general and easily applicable across diffusion models, and needs no retraining. The metrics are defined clearly and human ratings align better with AlignScore compared to alternatives. The paper is easy to read and the information flows logically. The paper’s contribution lies in coming up with a structured large-scale benchmark making it valuable for real world u

Weaknesses

In the paper, the attributes used are mostly focused on color. Adding additional attributes like shape can help make accuracy apply to real world use cases and not just color and position. DALLE-3 comparison has only 100 prompts which is relatively smaller. Revise and Enforce is assumed to depend on correctly identifying the failed parts. It would be great to explain the method used to identify the failed parts and turn them into prompts.

Reviewer 02Rating 4Confidence 4

Strengths

1. Significant Benchmark Scale and Complexity: It presents largest T2I compositional alignment dataset to date (10k prompts), and supports long prompts, many relations, and multiple instances per category 2. Fine-Grained and Structured Metric Design. The paper evaluates object counts, colors, spatial relations via automated detectors, and distinguishes between bias (count errors) and accuracy (attributes/relations)

Weaknesses

The benchmark depends heavily on automatic detectors for object, color, and relation evaluation. While this enables scalability, it also inherits all failure cases of those detectors, particularly in stylized or abstract generations. The lack of reported human validation raises questions about the accuracy of the scores and whether false detector errors might inflate failure rates. Another limitation is the synthetic nature of prompt construction. Prompts are generated using probabilistic rules

Reviewer 03Rating 2Confidence 4

Strengths

1. The paper proposes to evaluate both acc and bias, with an exhaustive searching method to determine the final acc. The idea of evaluating two metrics is reasonable. 2. The paper is well-written and easy to follow.

Weaknesses

1. Lack of novelty of the proposed revise-then-enforce method. First, the method appears to be hardly related to the proposed benchmark and scoring metrics. This makes the paper separate into two unrelated parts. Additionally, a similar idea has already been explored in earlier works, but the paper fails to mention [1]. 2. In Line 201, the benchmark refrains from mentioning words like fresh and majestically to ensure accurate evaluation. This idea does not seem reasonable and therefore severely

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.