Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
Tao Zhang, Cheng Da, Kun Ding, Huan Yang, Kun Jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, Chunhong Pan

TL;DR
This paper introduces Latent Reward Model and Latent Preference Optimization, leveraging diffusion models for step-level preference alignment in noisy latent space, resulting in faster training and improved preference alignment.
Contribution
It proposes a novel latent space reward model and preference optimization method that utilize diffusion models directly, enhancing efficiency and effectiveness over pixel-based approaches.
Findings
Significant improvement in preference alignment across multiple criteria.
Achieved 2.5-28x faster training speeds.
Effective in handling noisy images at various timesteps.
Abstract
Preference optimization for diffusion models aims to align them with human preferences for images. Previous methods typically use Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences. However, when used for step-level preference optimization, these models face challenges in handling noisy images of different timesteps and require complex transformations into pixel space. In this work, we show that pre-trained diffusion models are naturally suited for step-level reward modeling in the noisy latent space, as they are explicitly designed to process latent images at various noise levels. Accordingly, we propose the Latent Reward Model (LRM), which repurposes components of the diffusion model to predict preferences of latent images at arbitrary timesteps. Building on LRM, we introduce Latent Preference Optimization (LPO), a step-level preference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProduct Development and Customization · Advanced Multi-Objective Optimization Algorithms
MethodsALIGN · Diffusion
