TL;DR
This paper introduces NExt, a nonlinear low-rank trajectory modeling framework that accelerates large language model reinforcement learning with verifiable rewards, reducing computational costs significantly.
Contribution
The paper proposes a novel nonlinear extrapolation method for low-rank parameter trajectories, improving RLVR efficiency for large language models.
Findings
Reduces RLVR computational overhead by approximately 37.5%.
Effectively models nonlinear parameter trajectories during RLVR.
Demonstrates robustness across various tasks and algorithms.
Abstract
Recently, scaling reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs) has emerged as an effective training paradigm for significantly improving model capabilities, which requires guiding the model to perform extensive exploration and learning, leading to substantial computational overhead and becoming a key challenge. To reduce the number of training steps, Prior work performs linear extrapolation of model parameters. However, the dynamics of model parameter updates during RLVR training remain insufficiently understood. To further investigate the evolution of LLMs during RLVR training, we conduct empirical experiments and find that the rank-1 subspace of the model does not evolve linearly, and its dominance over the original parameters is further amplified during LoRA training. Based on the above insights, we propose the \textbf{N}onlinear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
