DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?

Qirui Jiao; Daoyuan Chen; Yilun Huang; Xika Lin; Ying Shen; Yaliang Li

arXiv:2505.16915·cs.CV·October 14, 2025

DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?

Qirui Jiao, Daoyuan Chen, Yilun Huang, Xika Lin, Ying Shen, Yaliang Li

PDF

1 Repo 1 Datasets 3 Reviews

TL;DR

DetailMaster introduces a new benchmark to evaluate text-to-image models' ability to handle long, complex prompts, revealing significant limitations in current models' compositional reasoning and attribute binding capabilities.

Contribution

This paper presents the first comprehensive benchmark specifically designed for assessing T2I models on long, detailed prompts, including evaluation dimensions and open-sourcing the dataset and tools.

Findings

01

Current models achieve ~50% accuracy in key dimensions

02

Performance degrades with increasing prompt length

03

Fundamental limitations in compositional reasoning are identified

Abstract

While recent text-to-image (T2I) models show impressive capabilities in synthesizing images from brief descriptions, their performance significantly degrades when confronted with long, detail-intensive prompts required in professional applications. We present DetailMaster, the first comprehensive benchmark specifically designed to evaluate T2I models' systematic abilities to handle extended textual inputs that contain complex compositional requirements. Our benchmark introduces four critical evaluation dimensions: Character Attributes, Structured Character Locations, Multi-Dimensional Scene Attributes, and Spatial/Interactive Relationships. The benchmark comprises long and detail-rich prompts averaging 284.89 tokens, with high quality validated by expert annotators. Evaluation on 7 general-purpose and 5 long-prompt-optimized T2I models reveals critical performance limitations:…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 5

Strengths

* Data curation pipeline * Analysis of limitations of current benchmarks pertaining to long prompts * Robustness and validity of the benchmark through human evaluation

Weaknesses

* Lack of controlled experiments to drive the insights and analyses in Section 4. It would have been much better to take a single model architecture and ablate it under different setups to drive the insights. More on this in "Questions". * Lack of results with several models that are known for handling long and complex prompts (such as SANA [1], Lumina-Next [2], and QwenImage [3]). * It's said in the paper multiple times (L39, for example) that T2I models are trained on short-length prompts. How

Reviewer 02Rating 4Confidence 4

Strengths

1. The consistency of text and images in complex long prompts is crucial for evaluating the capabilities of T2I models, and existing benchmarks are indeed lacking in this aspect; 2. The benchmark synthesis process in the paper comprehensively considers various aspects under long text prompts, such as Character Attributes, Structured Character Locations, and Multi-Dimensional Scene Attributes; 3. The paper conducts extensive experiments on existing open-source and closed-source models, indicating

Weaknesses

1. Recent diffusion models that use MLLM as a text encoder, such as Hunayuan Image 3.0 and Qwen-Image, possess stronger text understanding capabilities. How do these models perform on DetailMaster? 2. During evaluation, DetailMaster needs to detect the bounding box for each character based on the Character List. How does it handle cases when the prompt contains multiple repeated characters and there are interactions between these repeated characters? 3. Due to the inherent hallucinations of LLMs

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper proposes the comprehensive compositional dataset on long, complex prompts. 2. The paper is well-written and easy-to-follow. 3. The experiments are extensive.

Weaknesses

1. The paper lacks discussion of ConceptMix, which targets at compositional T2I generation. 2. The attribute pipeline relies on MLLM (e.g., use MLLM to identify its background composition, lighting conditions, and stylistic elements), which may introduce hallucinations or mistakes. And use MLLM as evaluators may still introduce problems though authors tried to mitigate. For example, the evaluation results are not easy to reproduce. [A] Wu X, Yu D, Huang Y, et al. Conceptmix: A compositional im

Code & Models

Repositories

modelscope/data-juicer
pytorchOfficial

Datasets

datajuicer/DetailMaster
dataset· 37 dl
37 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.