Fake it till You Make it: Reward Modeling as Discriminative Prediction
Runtao Liu, Jiahao Zhan, Yingqing He, Chen Wei, Alan Yuille, Qifeng Chen

TL;DR
This paper introduces GAN-RM, a novel reward modeling framework that uses adversarial discrimination between target and generated samples, reducing reliance on manual annotations and engineering, and improving reinforcement learning for visual generative models.
Contribution
GAN-RM eliminates manual preference annotation and explicit quality dimension engineering by training a reward model through adversarial discrimination with minimal target data.
Findings
Effective in test-time sample filtering for quality selection
Improves post-training reinforcement learning methods like SFT and DPO
Requires only a few hundred target samples for training
Abstract
An effective reward model plays a pivotal role in reinforcement learning for post-training enhancement of visual generative models. However, current approaches of reward modeling suffer from implementation complexity due to their reliance on extensive human-annotated preference data or meticulously engineered quality dimensions that are often incomplete and engineering-intensive. Inspired by adversarial training in generative adversarial networks (GANs), this paper proposes GAN-RM, an efficient reward modeling framework that eliminates manual preference annotation and explicit quality dimension engineering. Our method trains the reward model through discrimination between a small set of representative, unpaired target samples(denoted as Preference Proxy Data) and model-generated ordinary outputs, requiring only a few hundred target samples. Comprehensive experiments demonstrate our…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Reward modeling as discriminative prediction using PPD instead of labeled preferences. 2. Rank-based bootstrapping to expand pseudo labels and support multi-round post-training. 3. Demonstrations across image quality, safety alignment, and video generation, with competitive performance to DiffusionDPO trained on ~1M labels while using ~0.5k target samples.
1. GAN-RM can use only 0.5k data to achieve comparative performance with those methods that use 1M data. What about use 1M data to train GAN-RM? Table 11 is not enough to explain whether GAN-RM can be scaled. 2. Potential domain bias: Discriminator could learn style/domain artifacts of PPD rather than human “preference” per se; risk of reward hacking toward PPD distribution.
1. This discriminator is built upon open-source, pretrained models (CLIP-Vision), with a small MLP called RPL, making it accessible to all practitioners; the results are easily reproducible. 2. The method is indeed data-efficient, using a very small and unpaired dataset to train the discriminator. 3. Strong empirical validation, extensive experiments show that this simple method can indeed be somewhat equivalent to much more expensive methods like DiffusionDPO.
Major concerns: 1. The name GAN-RM is conceptually misleading: there is no adversarial training between generator and discriminator, nor any min–max optimization. The method consists of supervised discriminative training of a reward model (binary classifier) followed by generator fine-tuning using pseudo-preference data. It would be clearer to present this as Reward Modeling via Discriminative Prediction rather than a GAN variant. The current terminology may confuse readers and overstate the co
1. The method presented by the paper is computationally efficient by freezing the CLIP encoder and only training a small MLP head. This makes training fast and practical for real-world deployment. 2. Broad experimental validation across multiple base models (SD1.5, SDXL, VideoCrafter2), multiple domains (image quality, safety, video), and multiple post-training methods. The method demonstrates consistent improvements across diverse settings. 3. Comprehensive ablation studies is presented in the
1. The core claim of "eliminates manual preference annotation" is misleading. The method still depends on a pre-curated Preference Proxy Data (PPD) of 500 high-quality JourneyDB images, i.e. someone already judged quality. And comparing 500 curated samples against 1M crowd-sourced labels are not fair. 2. The paper has limited technical novelty. "GAN" is mostly branding. The reward model is a CLIP-vision feature extractor + small MLP trained as a binary classifier to separate PPD images from mod
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCustomer churn and segmentation
MethodsSparse Evolutionary Training
