RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment
Liyao Jiang, Ruichen Chen, Chao Gao, Di Niu

TL;DR
RAISE introduces a training-free, adaptive evolutionary framework that improves text-to-image alignment by dynamically refining images at inference time based on requirement satisfaction, reducing computation and enhancing fidelity.
Contribution
It presents a novel requirement-driven evolutionary method for inference-time image refinement that adapts to prompt complexity without additional training or fine-tuning.
Findings
Achieves state-of-the-art alignment scores on GenEval with fewer samples.
Reduces generated samples by 30-40% and VLM calls by 80% compared to prior methods.
Demonstrates effective, model-agnostic self-improvement across benchmarks.
Abstract
Recent text-to-image (T2I) diffusion models achieve remarkable realism, yet faithful prompt-image alignment remains challenging, particularly for complex prompts with multiple objects, relations, and fine-grained attributes. Existing training-free inference-time scaling methods rely on fixed iteration budgets that cannot adapt to prompt difficulty, while reflection-tuned models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision-language models, often overfitting to reflection paths data and lacking transferability across models. We introduce RAISE (Requirement-Adaptive Self-Improving Evolution), a training-free, requirement-driven evolutionary framework for adaptive T2I generation. RAISE formulates image generation as a requirement-driven adaptive scaling process, evolving a population of candidates at inference time through a diverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
