Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation
Subin Kim, Sangwoo Mo, Mamshad Nayeem Rizve, Yiran Xu, Difan Liu, Jinwoo Shin, Tobias Hinz

TL;DR
This paper introduces PRIS, a framework that adaptively revises prompts during inference for text-to-visual generation, significantly improving alignment and quality by addressing limitations of fixed prompts during scaling.
Contribution
PRIS is the first method to dynamically revise prompts during inference, using a new verifier for fine-grained alignment, leading to substantial quality improvements.
Findings
Achieved a 15% gain on VBench 2.0 benchmark.
Effectively identifies recurring failure patterns in generated visuals.
Enhances prompt-visual alignment through adaptive prompt redesign.
Abstract
Achieving precise alignment between user intent and generated visuals remains a central challenge in text-to-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because the prompt, crucial for guiding generation, is kept fixed. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt. To provide precise alignment feedback for prompt…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Strong Empirical Results: The paper shows impressive quantitative gains, such as a +15% improvement on VBench 2.0 and consistent outperformance against BoN and other baselines (Table 1, 2, 4). - Clear Problem Formulation: The paper correctly identifies a key limitation of existing inference-time scaling methods: they scale visuals (e.g., sampling steps, seeds) but keep the prompt fixed, which leads to a quality plateau.
- Critical Lack of Novelty: As detailed under "Contribution," the paper's core idea is not new. The PRIS framework (Generate -> Verify -> Revise -> Regenerate) is functionally identical to the "reflection and guidance" or "verify and reinforce" loops proposed in prior CoT-based generation work, such as Guo et al. (2025) and Jiang et al. (2025). - Incremental Contribution: The paper's attempt to distinguish itself by using "off-the-shelf MLLMs" instead of "unified models" is a minor implementati
The paper tackles an important and interesting problem and motivates it very well. It further gives a good overview of the state of the art. I also had the feeling that all design steps were well justified. The results are convincing and clearly show the effectiveness of the proposed method. The overall presentation of the paper is very clear and allows fluent reading. The proposed method is thoroughly evaluated w.r.t. various different perspectives (e.g., fixed compute budget, integration with
My main concern about the paper is mostly about its simplicity and the way some prompts are formulated to make the baseline methods look bad. I had the feeling that many results where the baseline methods looked bad in comparison could be easily resolved by taking the input prompt and asking an LLM to reformulate negations. Many results shown in the paper have prompts like "no laces", "fork is not wooden", "not wearing a helmet" etc. By just throwing out negations early-on could probably solve m
1. The idea of using an MLLM to analyze the common failures across multiple, distinct visual generations and then revising the prompt based on the common failures is interesting. 2. The paper demonstrates strong performance, achieving significant gains.
1. This paper focuses primarily on common failures while overlooking other potential failure cases, suggesting that the final outputs may still suffer from misalignment issues. 2. Several statements and procedures lack sufficient detail or contain inaccuracies. For example: - The claim in Line 52 that prior methods are limited because they “operate solely in the text domain” seems unfair, as your method also operates in the text domain (i.e., by redesigning the prompt). - The relationship bet
1. The writing and illustrations of the paper is good and easy to follow. The task settings is clear. The appendix is rich. 2. The proposed method is intuitive and reasonable to the reviewer. 3. This paper focuses on an interesting question in the field of visual generation.
1. The method is good overall. But the reviewer still concerns the performance gained compared with prior "prompt-refine" methods severely. As there are a lot of existing methods focusing on test-time "prompt-refine" after "Design Guidelines for Prompt Engineering Text-to-Image Generative Models [CHI 2022]", like: a. Optimizing Prompts for Text-to-Image Generation [NeurIPS 2023] b. From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflectio
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Data Visualization and Analytics
