Policy Optimized Text-to-Image Pipeline Design
Uri Gadot, Rinon Gal, Yftah Ziser, Gal Chechik, Shie Mannor

TL;DR
This paper presents a reinforcement learning framework for designing text-to-image pipelines that improves image quality and diversity while reducing computational costs, surpassing existing automated methods.
Contribution
It introduces a reward model ensemble and a two-phase training strategy with GRPO optimization for efficient pipeline design in text-to-image generation.
Findings
Achieves higher image quality than baseline methods.
Creates more diverse pipeline workflows.
Reduces computational costs compared to traditional approaches.
Abstract
Text-to-image generation has evolved beyond single monolithic models to complex multi-component pipelines. These combine fine-tuned generators, adapters, upscaling blocks and even editing steps, leading to significant improvements in image quality. However, their effective design requires substantial expertise. Recent approaches have shown promise in automating this process through large language models (LLMs), but they suffer from two critical limitations: extensive computational requirements from generating images with hundreds of predefined pipelines, and poor generalization beyond memorized training examples. We introduce a novel reinforcement learning-based framework that addresses these inefficiencies. Our approach first trains an ensemble of reward models capable of predicting image quality scores directly from prompt-workflow combinations, eliminating the need for costly image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsICT Impact and Policies · Multimedia Communication and Technology
