On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation

Kexin Huang; Haoming Meng; Junkang Wu; Jinda Lu; Chiyu Ma; Ziqian Chen; Xue Wang; Bolin Ding; Jiancan Wu; Xiang Wang; Xiangnan He; Guoyin Wang; Jingren Zhou

arXiv:2603.22117·cs.LG·March 24, 2026

On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation

Kexin Huang, Haoming Meng, Junkang Wu, Jinda Lu, Chiyu Ma, Ziqian Chen, Xue Wang, Bolin Ding, Jiancan Wu, Xiang Wang, Xiangnan He, Guoyin Wang, Jingren Zhou

PDF

Open Access

TL;DR

This paper emphasizes the importance of the direction of updates in RLVR for LLM reasoning, proposing methods that leverage this insight to enhance reasoning accuracy without additional training.

Contribution

It introduces the signed token-level log probability difference as a key metric and develops practical methods for test-time and training-time improvements based on update direction.

Findings

01

Direction of RLVR updates better identifies reasoning-critical changes

02

Amplifying policy along update direction improves reasoning accuracy

03

Focusing training on high-impact tokens enhances model performance

Abstract

Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models. While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the \textbf{magnitude} of these updates, largely overlooking their \textbf{direction}. In this work, we argue that the direction of updates is a more critical lens for understanding RLVR's effects, which can be captured by the signed, token-level log probability difference $Δ lo g p$ between the base and final RLVR models. Through statistical analysis and token-replacement interventions, we demonstrate that $Δ lo g p$ more effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics (\eg divergence or entropy). Building on this insight, we propose two practical applications: (1) a \textit{test-time extrapolation} method that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)