TL;DR
This paper reveals that RLVR weight trajectories are low-rank and predictable, enabling a simple linear extrapolation method called RELEX to efficiently approximate future checkpoints with minimal training.
Contribution
The authors introduce RELEX, a novel rank-1 extrapolation technique that significantly reduces RLVR training steps needed for large language models.
Findings
RELEX matches or exceeds RLVR performance with only 15% of training steps.
RELEX can extrapolate checkpoints up to 10-20 times beyond the observed data.
The method's success is due to denoising effects from rank-1 projection.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low-rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Motivated by this, we propose a simple and compute-efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank-1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
