# Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

**Authors:** Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang

arXiv: 2508.20751 · 2026-04-21

## TL;DR

Pref-GRPO introduces a pairwise preference reward-based reinforcement learning method for text-to-image generation, improving stability and reducing reward hacking, supported by a new comprehensive benchmark for evaluation.

## Contribution

The paper proposes Pref-GRPO, a novel pairwise preference reward approach for stable T2I training, and introduces UniGenBench, a detailed benchmark for evaluating T2I models.

## Key findings

- Pref-GRPO effectively differentiates subtle image quality differences.
- Pref-GRPO provides more stable training and mitigates reward hacking.
- UniGenBench offers a comprehensive evaluation of T2I models.

## Abstract

Recent advancements highlight the importance of GRPO-based reinforcement learning methods and benchmarking in enhancing text-to-image (T2I) generation. However, current methods using pointwise reward models (RM) for scoring generated images are susceptible to reward hacking. We reveal that this happens when minimal score differences between images are amplified after normalization, creating illusory advantages that drive the model to over-optimize for trivial gains, ultimately destabilizing the image generation process. To address this, we propose Pref-GRPO, a pairwise preference reward-based GRPO method that shifts the optimization objective from score maximization to preference fitting, ensuring more stable training. In Pref-GRPO, images are pairwise compared within each group using preference RM, and the win rate is used as the reward signal. Extensive experiments demonstrate that PREF-GRPO differentiates subtle image quality differences, providing more stable advantages and mitigating reward hacking. Additionally, existing T2I benchmarks are limited by coarse evaluation criteria, hindering comprehensive model assessment. To solve this, we introduce UniGenBench, a unified T2I benchmark comprising 600 prompts across 5 main themes and 20 subthemes. It evaluates semantic consistency through 10 primary and 27 sub-criteria, leveraging MLLM for benchmark construction and evaluation. Our benchmarks uncover the strengths and weaknesses of both open and closed-source T2I models and validate the effectiveness of Pref-GRPO.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20751/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20751/full.md

## References

43 references — full list in the complete paper: https://tomesphere.com/paper/2508.20751/full.md

---
Source: https://tomesphere.com/paper/2508.20751