Reward Sharpness-Aware Fine-Tuning for Diffusion Models

Kwanyoung Kim; Byeongsu Sim

arXiv:2603.21175·cs.LG·March 24, 2026

Reward Sharpness-Aware Fine-Tuning for Diffusion Models

Kwanyoung Kim, Byeongsu Sim

PDF

Open Access

TL;DR

This paper introduces RSA-FT, a method that improves the robustness of reward models in diffusion reinforcement learning, reducing reward hacking and enhancing alignment with human preferences.

Contribution

We propose a novel approach that exploits gradients from a robustified reward model to mitigate reward hacking in diffusion models without retraining the reward model.

Findings

01

Each method reduces reward hacking independently.

02

Joint use of methods amplifies robustness and alignment.

03

RSA-FT consistently improves reliability of RDRL.

Abstract

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models with human preferences, inspiring the development of reward-centric diffusion reinforcement learning (RDRL) to achieve similar alignment and controllability. While diffusion models can generate high-quality outputs, RDRL remains susceptible to reward hacking, where the reward score increases without corresponding improvements in perceptual quality. We demonstrate that this vulnerability arises from the non-robustness of reward model gradients, particularly when the reward landscape with respect to the input image is sharp. To mitigate this issue, we introduce methods that exploit gradients from a robustified reward model without requiring its retraining. Specifically, we employ gradients from a flattened reward model, obtained through parameter perturbations of the diffusion model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics