Fake it till You Make it: Reward Modeling as Discriminative Prediction

Runtao Liu; Jiahao Zhan; Yingqing He; Chen Wei; Alan Yuille; Qifeng Chen

arXiv:2506.13846·cs.CV·June 27, 2025

Fake it till You Make it: Reward Modeling as Discriminative Prediction

Runtao Liu, Jiahao Zhan, Yingqing He, Chen Wei, Alan Yuille, Qifeng Chen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces GAN-RM, a novel reward modeling framework that uses adversarial discrimination between target and generated samples, reducing reliance on manual annotations and engineering, and improving reinforcement learning for visual generative models.

Contribution

GAN-RM eliminates manual preference annotation and explicit quality dimension engineering by training a reward model through adversarial discrimination with minimal target data.

Findings

01

Effective in test-time sample filtering for quality selection

02

Improves post-training reinforcement learning methods like SFT and DPO

03

Requires only a few hundred target samples for training

Abstract

An effective reward model plays a pivotal role in reinforcement learning for post-training enhancement of visual generative models. However, current approaches of reward modeling suffer from implementation complexity due to their reliance on extensive human-annotated preference data or meticulously engineered quality dimensions that are often incomplete and engineering-intensive. Inspired by adversarial training in generative adversarial networks (GANs), this paper proposes GAN-RM, an efficient reward modeling framework that eliminates manual preference annotation and explicit quality dimension engineering. Our method trains the reward model through discrimination between a small set of representative, unpaired target samples(denoted as Preference Proxy Data) and model-generated ordinary outputs, requiring only a few hundred target samples. Comprehensive experiments demonstrate our…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. Reward modeling as discriminative prediction using PPD instead of labeled preferences. 2. Rank-based bootstrapping to expand pseudo labels and support multi-round post-training. 3. Demonstrations across image quality, safety alignment, and video generation, with competitive performance to DiffusionDPO trained on ~1M labels while using ~0.5k target samples.

Weaknesses

1. GAN-RM can use only 0.5k data to achieve comparative performance with those methods that use 1M data. What about use 1M data to train GAN-RM? Table 11 is not enough to explain whether GAN-RM can be scaled. 2. Potential domain bias: Discriminator could learn style/domain artifacts of PPD rather than human “preference” per se; risk of reward hacking toward PPD distribution.

Reviewer 02Rating 4Confidence 5

Strengths

1. This discriminator is built upon open-source, pretrained models (CLIP-Vision), with a small MLP called RPL, making it accessible to all practitioners; the results are easily reproducible. 2. The method is indeed data-efficient, using a very small and unpaired dataset to train the discriminator. 3. Strong empirical validation, extensive experiments show that this simple method can indeed be somewhat equivalent to much more expensive methods like DiffusionDPO.

Weaknesses

Major concerns: 1. The name GAN-RM is conceptually misleading: there is no adversarial training between generator and discriminator, nor any min–max optimization. The method consists of supervised discriminative training of a reward model (binary classifier) followed by generator fine-tuning using pseudo-preference data. It would be clearer to present this as Reward Modeling via Discriminative Prediction rather than a GAN variant. The current terminology may confuse readers and overstate the co

Reviewer 03Rating 4Confidence 3

Strengths

1. The method presented by the paper is computationally efficient by freezing the CLIP encoder and only training a small MLP head. This makes training fast and practical for real-world deployment. 2. Broad experimental validation across multiple base models (SD1.5, SDXL, VideoCrafter2), multiple domains (image quality, safety, video), and multiple post-training methods. The method demonstrates consistent improvements across diverse settings. 3. Comprehensive ablation studies is presented in the

Weaknesses

1. The core claim of "eliminates manual preference annotation" is misleading. The method still depends on a pre-curated Preference Proxy Data (PPD) of 500 high-quality JourneyDB images, i.e. someone already judged quality. And comparing 500 curated samples against 1M crowd-sourced labels are not fair. 2. The paper has limited technical novelty. "GAN" is mostly branding. The reward model is a CLIP-vision feature extractor + small MLP trained as a binary classifier to separate PPD images from mod

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCustomer churn and segmentation

MethodsSparse Evolutionary Training