On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR
Hao Ye, Jisheng Dang, Junfeng Fang, Bimei Wang, Yizhou Zhang, Ning Lv, Wencan Zhang, Hong Peng, Bin Hu, Tat-Seng Chua

TL;DR
This paper investigates how RLVR models focus on rank-1 components for reasoning, revealing implicit reward overfitting, low-rank dynamics, and spectral properties that influence training and continual learning.
Contribution
It uncovers the low-rank spectral dynamics and implicit reward overfitting phenomena in RLVR, providing insights into model parameter shaping and training behavior.
Findings
RLVR's reasoning is concentrated in rank-1 components.
Models can perform well with low rewards during training.
Singular value distributions in RLVR are heavy-tailed.
Abstract
Recent extensive research has demonstrated that the enhanced reasoning capabilities acquired by models through Reinforcement Learning with Verifiable Rewards (RLVR) are primarily concentrated within the rank-1 components. Predicated on this observation, we employed Periodic Rank-1 Substitution and identified a counterintuitive phenomenon: RLVR may exhibit implicit reward overfitting to the training dataset. Specifically, the model can achieve satisfactory performance on the test set even when its rewards remain relatively low during the training process. Furthermore, we characterize three distinct properties of RL training: (1) The effective rank-1 component in RLVR don't maintain other model knowledge except mathematical reasoning capability. (2) RLVR fundamentally functions by optimizing a specific singular spectrum. The distribution of singular values of almost all linear layers in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
