PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning

Jinlong Liu; Wanggui He; Peng Zhang; Mushui Liu; Hao Jiang; Pipei Huang

arXiv:2604.12652·cs.CV·April 23, 2026

PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning

Jinlong Liu, Wanggui He, Peng Zhang, Mushui Liu, Hao Jiang, Pipei Huang

PDF

2 Models

TL;DR

PromptEcho offers an annotation-free, efficient reward method for text-to-image reinforcement learning by leveraging frozen vision-language models to assess image-text alignment, improving prompt-following without additional training.

Contribution

It introduces PromptEcho, a novel reward construction approach that requires no annotation or reward model training, utilizing pre-trained VLMs for improved RL performance.

Findings

01

PromptEcho significantly outperforms inference-based scoring methods.

02

Reward quality improves with larger VLMs.

03

Achieves substantial gains on DenseAlignBench and other benchmarks.

Abstract

Reinforcement learning (RL) can improve the prompt following capability of text-to-image (T2I) models, yet obtaining high-quality reward signals remains challenging: CLIP Score is too coarse-grained, while VLM-based reward models (e.g., RewardDance) require costly human-annotated preference data and additional fine-tuning. We propose PromptEcho, a reward construction method that requires \emph{no} annotation and \emph{no} reward model training. Given a generated image and a guiding query, PromptEcho computes the token-level cross-entropy loss of a frozen VLM with the original prompt as the label, directly extracting the image-text alignment knowledge encoded during VLM pretraining. The reward is deterministic, computationally efficient, and improves automatically as stronger open-source VLMs become available. For evaluation, we develop DenseAlignBench, a benchmark of concept-rich dense…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.