UMM-RM: An Upcycle-and-Merge MoE Reward Model for Mitigating Reward Hacking

Lingling Fu; Yongfu Xue

arXiv:2512.00724·cs.LG·February 3, 2026

UMM-RM: An Upcycle-and-Merge MoE Reward Model for Mitigating Reward Hacking

Lingling Fu, Yongfu Xue

PDF

Open Access

TL;DR

This paper introduces UMM-RM, a novel reward model that mitigates reward hacking in reinforcement learning from human feedback by using a mixture-of-experts approach with shared experts, improving robustness and stability.

Contribution

The paper proposes UMM-RM, a mixture-of-experts reward model that enhances robustness against reward hacking and maintains inference efficiency by consolidating experts after training.

Findings

01

UMM-RM improves preference accuracy over dense RMs.

02

UMM-RM reduces reward hacking during PPO training.

03

UMM-RM achieves more stable preference alignment.

Abstract

Reward models (RMs) are a critical component of reinforcement learning from human feedback (RLHF). However, conventional dense RMs are susceptible to exploitation by policy models through biases or spurious correlations, resulting in reward hacking: RM scores increase during training while alignment with human preferences deteriorates, a problem that is further exacerbated under distribution shift.To address this issue, we propose UMM-RM (Upcycle-and-Merge MoE Reward Model). UMM-RM first upscales the feed-forward layers of a dense backbone into a mixture-of-experts (MoE) reward model with shared experts. The shared experts are always activated to capture instruction-agnostic preference signals, while the remaining experts model fine-grained preferences across instructions or task regimes. After training, the experts are consolidated into a single dense RM via learnable merging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Emotion and Mood Recognition · Explainable Artificial Intelligence (XAI)