Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO

Chengzhuo Tong; Ziyu Guo; Renrui Zhang; Wenyu Shan; Xinyu Wei; Zhenghao Xing; Hongsheng Li; Pheng-Ann Heng

arXiv:2505.17017·cs.CV·June 11, 2025

Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO

Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, Pheng-Ann Heng

PDF

Open Access 3 Repos

TL;DR

This paper investigates the application of reinforcement learning algorithms DPO and GRPO to autoregressive image generation, analyzing their performance, generalization, and the impact of reward models to improve image quality and consistency.

Contribution

It provides the first comprehensive analysis of DPO and GRPO in autoregressive image generation, highlighting their advantages and the role of reward models in enhancing generalization.

Findings

01

GRPO and DPO have distinct strengths in image generation tasks.

02

Reward models with better generalization improve RL algorithm performance.

03

Scaling strategies significantly boost in-domain and out-of-domain capabilities.

Abstract

Recent advancements underscore the significant role of Reinforcement Learning (RL) in enhancing the Chain-of-Thought (CoT) reasoning capabilities of large language models (LLMs). Two prominent RL algorithms, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), are central to these developments, showcasing different pros and cons. Autoregressive image generation, also interpretable as a sequential CoT reasoning process, presents unique challenges distinct from LLM-based CoT reasoning. These encompass ensuring text-image consistency, improving image aesthetic quality, and designing sophisticated reward models, rather than relying on simpler rule-based rewards. While recent efforts have extended RL to this domain, these explorations typically lack an in-depth analysis of the domain-specific challenges and the characteristics of different RL strategies. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning

MethodsDirect Preference Optimization