DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?
Qirui Jiao, Daoyuan Chen, Yilun Huang, Xika Lin, Ying Shen, Yaliang Li

TL;DR
DetailMaster introduces a new benchmark to evaluate text-to-image models' ability to handle long, complex prompts, revealing significant limitations in current models' compositional reasoning and attribute binding capabilities.
Contribution
This paper presents the first comprehensive benchmark specifically designed for assessing T2I models on long, detailed prompts, including evaluation dimensions and open-sourcing the dataset and tools.
Findings
Current models achieve ~50% accuracy in key dimensions
Performance degrades with increasing prompt length
Fundamental limitations in compositional reasoning are identified
Abstract
While recent text-to-image (T2I) models show impressive capabilities in synthesizing images from brief descriptions, their performance significantly degrades when confronted with long, detail-intensive prompts required in professional applications. We present DetailMaster, the first comprehensive benchmark specifically designed to evaluate T2I models' systematic abilities to handle extended textual inputs that contain complex compositional requirements. Our benchmark introduces four critical evaluation dimensions: Character Attributes, Structured Character Locations, Multi-Dimensional Scene Attributes, and Spatial/Interactive Relationships. The benchmark comprises long and detail-rich prompts averaging 284.89 tokens, with high quality validated by expert annotators. Evaluation on 7 general-purpose and 5 long-prompt-optimized T2I models reveals critical performance limitations:…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
* Data curation pipeline * Analysis of limitations of current benchmarks pertaining to long prompts * Robustness and validity of the benchmark through human evaluation
* Lack of controlled experiments to drive the insights and analyses in Section 4. It would have been much better to take a single model architecture and ablate it under different setups to drive the insights. More on this in "Questions". * Lack of results with several models that are known for handling long and complex prompts (such as SANA [1], Lumina-Next [2], and QwenImage [3]). * It's said in the paper multiple times (L39, for example) that T2I models are trained on short-length prompts. How
1. The consistency of text and images in complex long prompts is crucial for evaluating the capabilities of T2I models, and existing benchmarks are indeed lacking in this aspect; 2. The benchmark synthesis process in the paper comprehensively considers various aspects under long text prompts, such as Character Attributes, Structured Character Locations, and Multi-Dimensional Scene Attributes; 3. The paper conducts extensive experiments on existing open-source and closed-source models, indicating
1. Recent diffusion models that use MLLM as a text encoder, such as Hunayuan Image 3.0 and Qwen-Image, possess stronger text understanding capabilities. How do these models perform on DetailMaster? 2. During evaluation, DetailMaster needs to detect the bounding box for each character based on the Character List. How does it handle cases when the prompt contains multiple repeated characters and there are interactions between these repeated characters? 3. Due to the inherent hallucinations of LLMs
1. The paper proposes the comprehensive compositional dataset on long, complex prompts. 2. The paper is well-written and easy-to-follow. 3. The experiments are extensive.
1. The paper lacks discussion of ConceptMix, which targets at compositional T2I generation. 2. The attribute pipeline relies on MLLM (e.g., use MLLM to identify its background composition, lighting conditions, and stylistic elements), which may introduce hallucinations or mistakes. And use MLLM as evaluators may still introduce problems though authors tried to mitigate. For example, the evaluation results are not easy to reproduce. [A] Wu X, Yu D, Huang Y, et al. Conceptmix: A compositional im
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
