Why Settle for One? Text-to-ImageSet Generation and Evaluation
Chengyou Jia, Xin Shen, Zhuohang Dang, Zhuohang Dang, Changliang Xia, Weijia Wu, Xinyu Zhang, Hangwei Qian, Ivor W.Tsang, Minnan Luo

TL;DR
This paper introduces a new task called Text-to-ImageSet generation, along with a benchmark, evaluation framework, and a training-free method that leverages pretrained models to generate diverse, consistent image sets based on complex user instructions.
Contribution
The paper proposes the first comprehensive framework for Text-to-ImageSet generation, including a benchmark, evaluation metrics, and a novel zero-shot method leveraging pretrained diffusion transformers.
Findings
AutoT2IS outperforms existing methods on T2IS-Bench.
Diverse consistency requirements challenge current models.
AutoT2IS enables practical applications in real-world scenarios.
Abstract
Despite remarkable progress in Text-to-Image models, many real-world applications require generating coherent image sets with diverse consistency requirements. Existing consistent methods often focus on a specific domain with specific aspects of consistency, which significantly constrains their generalizability to broader applications. In this paper, we propose a more challenging problem, Text-to-ImageSet (T2IS) generation, which aims to generate sets of images that meet various consistency requirements based on user instructions. To systematically study this problem, we first introduce with 596 diverse instructions across 26 subcategories, providing comprehensive coverage for T2IS generation. Building on this, we propose , an evaluation framework that transforms user instructions into multifaceted assessment criteria and employs effective…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
- AutoT2IS achieves significant improvements over existing methods through a simple yet effective divide-and-conquer strategy that maximally leverages DiT's in-context capabilities. - Comprehensive experiments demonstrating competitive or superior performance compared to both generalized approaches and even specialized methods. - The presentation is clear and easy to follow.
- It seems the size of the image set (n) ranges from 2 to 5. The paper does not discuss the effect of this size number. For example, how will the size affect the quality and consistency of the images? What will happen when the size is extended to even larger numbers, such as 8 or 10? - In addition to text-to-image generation works, the authors should also carefully discuss another line of research, i.e., interleaved text-and-image generation, where the MLLM is also tasked with generating a seque
1. This paper is well-written and very easy to follow. 2. This paper proposes a novel task: text to image set. This task has application values and is very important for future image generation research. Besides, the newly design evaluation metrics are also reasonable. 3. This paper also proposes a training-free framework for this task, which is novel and promising. 4. Comprehensive evaluation proves this benchmark is important, revealing some limitations of current open-/closed-source models, a
From my perspective, as a benchmark paper, this is enough. But I still have several questions: 1. How do you make sure the evaluation consistency? The evaluation metrics are VLLM-based. It would be better to prove your proposed evaluation metrics is reasonable. For example, you can have a subset human evaluation and analyze the consistency of your metrics and human evaluation score. 2. It seems that the proposed framework performs worse than baselines in some cases in Table 2 (e.g., Style Desig
- The paper proposes a task of Text-to-ImageSet generation, which generates multiple images instead of single images. - A benchmark and evaluation framework are proposed for this task. - A training-free method is proposed that leverages LLMs for structured recaption and a set-aware generation strategy.
- The primary concern is the framing of the T2IS problem itself. The paper aggregates several pre-existing, distinct research tasks (e.g., 'Character Generation' , 'Story Generation' , 'Process Generation' ) under a new umbrella, as shown in Table 2. **It is not clearly articulated what new research challenge is unlocked by this aggregation.** The contribution appears to be more of a unified testbed rather than a novel research problem, which raises questions about the work's fundamental researc
* The paper proposes a fresh and interesting idea: generating sets of images that maintain both variety and consistency. This direction extends traditional text-to-image generation toward more general and realistic applications. * The method is described clearly, and the proposed approach that combines prompt concatenation and masked latent operations is creative and well-motivated. * The experiments are comprehensive, covering different domains and including strong comparisons with both open-so
* The paper could better explain the position and purpose of T2IS-Bench relative to existing datasets. Since the benchmark focuses on various types of visual consistency (such as identity, style, and logic), it would be helpful to clarify why combining existing datasets, such as those for multi-view or character generation, would not achieve the same generalization goal. For example, using datasets all at once like DTH [1], MipNeRF-360 [2], or SerialGen [3] to achieve multi-view or personalized
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Historical Architecture and Urbanism
