Probing RLVR training instability through the lens of objective-level hacking
Yiming Dong, Kun Fu, Haoyu Li, Xinyuan Zhu, Yurou Liu, Lijing Shao, Jieping Ye, Zheng Wang

TL;DR
This paper investigates the causes of training instability in RLVR for large language models, especially MoE architectures, by introducing a framework based on objective-level hacking and analyzing training-inference discrepancy growth.
Contribution
It introduces a novel framework for understanding RLVR instability through objective-level hacking and provides a mechanistic explanation for training-inference discrepancy growth in MoE models.
Findings
Identifies token-level credit misalignment as a source of instability.
Formalizes the mechanism behind abnormal training-inference discrepancy growth.
Provides insights for designing more stable RLVR algorithms.
Abstract
Prolonged reinforcement learning with verifiable rewards (RLVR) has been shown to drive continuous improvements in the reasoning capabilities of large language models, but the training is often prone to instabilities, especially in Mixture-of-Experts (MoE) architectures. Training instability severely undermines model capability improvement, yet its underlying causes and mechanisms remain poorly understood. In this work, we introduce a principled framework for understanding RLVR instability through the lens of objective-level hacking. Unlike reward hacking, which arises from exploitable verifiers, objective-level hacking emerges from token-level credit misalignment and is manifested as system-level spurious signals in the optimization objective. Grounded in our framework, together with extensive experiments on a 30B MoE model, we trace the origin and formalize the mechanism behind a key…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
