On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation
Kexin Huang, Haoming Meng, Junkang Wu, Jinda Lu, Chiyu Ma, Ziqian Chen, Xue Wang, Bolin Ding, Jiancan Wu, Xiang Wang, Xiangnan He, Guoyin Wang, Jingren Zhou

TL;DR
This paper emphasizes the importance of the direction of updates in RLVR for LLM reasoning, proposing methods that leverage this insight to enhance reasoning accuracy without additional training.
Contribution
It introduces the signed token-level log probability difference as a key metric and develops practical methods for test-time and training-time improvements based on update direction.
Findings
Direction of RLVR updates better identifies reasoning-critical changes
Amplifying policy along update direction improves reasoning accuracy
Focusing training on high-impact tokens enhances model performance
Abstract
Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models. While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the \textbf{magnitude} of these updates, largely overlooking their \textbf{direction}. In this work, we argue that the direction of updates is a more critical lens for understanding RLVR's effects, which can be captured by the signed, token-level log probability difference between the base and final RLVR models. Through statistical analysis and token-replacement interventions, we demonstrate that more effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics (\eg divergence or entropy). Building on this insight, we propose two practical applications: (1) a \textit{test-time extrapolation} method that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)
