Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation

Hongji Yang; Yucheng Zhou; Wencheng Han; Jianbing Shen

arXiv:2505.16763·cs.CV·December 16, 2025

Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation

Hongji Yang, Yucheng Zhou, Wencheng Han, Jianbing Shen

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel prompt optimization framework for text-to-image models that leverages large vision-language models (LVLMs) for rewriting prompts and scoring image quality, reducing reliance on manual data and human feedback.

Contribution

The proposed method uses LVLMs as both prompt rewriters and reward models in a unified reinforcement learning framework, enabling self-improvement without extensive labeled data.

Findings

01

Outperforms existing prompt optimization methods on benchmark datasets

02

Reduces dependence on manual annotations and trained aesthetic models

03

Demonstrates effective self-improvement through reinforcement learning

Abstract

Text-to-image models are powerful for producing high-quality images based on given text prompts, but crafting these prompts often requires specialized vocabulary. To address this, existing methods train rewriting models with supervision from large amounts of manually annotated data and trained aesthetic assessment models. To alleviate the dependence on data scale for model training and the biases introduced by trained models, we propose a novel prompt optimization framework, designed to rephrase a simple user prompt into a sophisticated prompt to a text-to-image model. Specifically, we employ the large vision language models (LVLMs) as the solver to rewrite the user prompt, and concurrently, employ LVLMs as a reward model to score the aesthetics and alignment of the images generated by the optimized prompt. Instead of laborious human feedback, we exploit the prior knowledge of the LVLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation· underline

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Aesthetic Perception and Analysis · Visual Attention and Saliency Detection