Reward Sharpness-Aware Fine-Tuning for Diffusion Models
Kwanyoung Kim, Byeongsu Sim

TL;DR
This paper introduces RSA-FT, a method that improves the robustness of reward models in diffusion reinforcement learning, reducing reward hacking and enhancing alignment with human preferences.
Contribution
We propose a novel approach that exploits gradients from a robustified reward model to mitigate reward hacking in diffusion models without retraining the reward model.
Findings
Each method reduces reward hacking independently.
Joint use of methods amplifies robustness and alignment.
RSA-FT consistently improves reliability of RDRL.
Abstract
Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models with human preferences, inspiring the development of reward-centric diffusion reinforcement learning (RDRL) to achieve similar alignment and controllability. While diffusion models can generate high-quality outputs, RDRL remains susceptible to reward hacking, where the reward score increases without corresponding improvements in perceptual quality. We demonstrate that this vulnerability arises from the non-robustness of reward model gradients, particularly when the reward landscape with respect to the input image is sharp. To mitigate this issue, we introduce methods that exploit gradients from a robustified reward model without requiring its retraining. Specifically, we employ gradients from a flattened reward model, obtained through parameter perturbations of the diffusion model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics
