Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation
Hongji Yang, Yucheng Zhou, Wencheng Han, Jianbing Shen

TL;DR
This paper introduces a novel prompt optimization framework for text-to-image models that leverages large vision-language models (LVLMs) for rewriting prompts and scoring image quality, reducing reliance on manual data and human feedback.
Contribution
The proposed method uses LVLMs as both prompt rewriters and reward models in a unified reinforcement learning framework, enabling self-improvement without extensive labeled data.
Findings
Outperforms existing prompt optimization methods on benchmark datasets
Reduces dependence on manual annotations and trained aesthetic models
Demonstrates effective self-improvement through reinforcement learning
Abstract
Text-to-image models are powerful for producing high-quality images based on given text prompts, but crafting these prompts often requires specialized vocabulary. To address this, existing methods train rewriting models with supervision from large amounts of manually annotated data and trained aesthetic assessment models. To alleviate the dependence on data scale for model training and the biases introduced by trained models, we propose a novel prompt optimization framework, designed to rephrase a simple user prompt into a sophisticated prompt to a text-to-image model. Specifically, we employ the large vision language models (LVLMs) as the solver to rewrite the user prompt, and concurrently, employ LVLMs as a reward model to score the aesthetics and alignment of the images generated by the optimized prompt. Instead of laborious human feedback, we exploit the prior knowledge of the LVLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Aesthetic Perception and Analysis · Visual Attention and Saliency Detection
