Test-time Prompt Refinement for Text-to-Image Models

Mohammad Abdul Hafeez Khan; Yash Jain; Siddhartha Bhattacharyya; Vibhav Vineet

arXiv:2507.22076·cs.LG·July 31, 2025

Test-time Prompt Refinement for Text-to-Image Models

Mohammad Abdul Hafeez Khan, Yash Jain, Siddhartha Bhattacharyya, Vibhav Vineet

PDF

TL;DR

This paper introduces TIR, a test-time prompt refinement framework that iteratively improves text-to-image generation quality by analyzing and refining prompts using a pretrained multimodal language model, without retraining the underlying model.

Contribution

The paper presents a novel closed-loop, test-time prompt refinement method that enhances T2I outputs by analyzing and adjusting prompts iteratively, without additional training of the base model.

Findings

01

Improves alignment and visual coherence in T2I outputs

02

Works with black-box T2I models without retraining

03

Effective across multiple benchmark datasets

Abstract

Text-to-image (T2I) generation models have made significant strides but still struggle with prompt sensitivity: even minor changes in prompt wording can yield inconsistent or inaccurate outputs. To address this challenge, we introduce a closed-loop, test-time prompt refinement framework that requires no additional training of the underlying T2I model, termed TIR. In our approach, each generation step is followed by a refinement step, where a pretrained multimodal large language model (MLLM) analyzes the output image and the user's prompt. The MLLM detects misalignments (e.g., missing objects, incorrect attributes) and produces a refined and physically grounded prompt for the next round of image generation. By iteratively refining the prompt and verifying alignment between the prompt and the image, TIR corrects errors, mirroring the iterative refinement process of human artists. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.