UMM-RM: An Upcycle-and-Merge MoE Reward Model for Mitigating Reward Hacking
Lingling Fu, Yongfu Xue

TL;DR
This paper introduces UMM-RM, a novel reward model that mitigates reward hacking in reinforcement learning from human feedback by using a mixture-of-experts approach with shared experts, improving robustness and stability.
Contribution
The paper proposes UMM-RM, a mixture-of-experts reward model that enhances robustness against reward hacking and maintains inference efficiency by consolidating experts after training.
Findings
UMM-RM improves preference accuracy over dense RMs.
UMM-RM reduces reward hacking during PPO training.
UMM-RM achieves more stable preference alignment.
Abstract
Reward models (RMs) are a critical component of reinforcement learning from human feedback (RLHF). However, conventional dense RMs are susceptible to exploitation by policy models through biases or spurious correlations, resulting in reward hacking: RM scores increase during training while alignment with human preferences deteriorates, a problem that is further exacerbated under distribution shift.To address this issue, we propose UMM-RM (Upcycle-and-Merge MoE Reward Model). UMM-RM first upscales the feed-forward layers of a dense backbone into a mixture-of-experts (MoE) reward model with shared experts. The shared experts are always activated to capture instruction-agnostic preference signals, while the remaining experts model fine-grained preferences across instructions or task regimes. After training, the experts are consolidated into a single dense RM via learnable merging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Emotion and Mood Recognition · Explainable Artificial Intelligence (XAI)
