Near-Future Policy Optimization
Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, Jiaqi Wang

TL;DR
This paper introduces NPO, a mixed-policy reinforcement learning scheme that uses a model's own near-future checkpoints to enhance training efficiency and performance, validated on a large language model.
Contribution
The paper proposes NPO, a novel method that leverages a model's future self for better off-policy trajectories, and introduces AutoNPO, an adaptive version that optimizes intervention timing.
Findings
NPO improves average performance from 57.88 to 62.84 on Qwen3-VL-8B-Instruct.
AutoNPO further increases performance to 63.15.
Both methods accelerate convergence and raise the performance ceiling.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe. Introducing suitable off-policy trajectories into on-policy exploration accelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existing mixed-policy methods either import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher , more new knowledge to learn) and close enough (lower , more readily absorbed) conditions required to maximize the effective learning signal . We propose \textbf{N}ear-Future \textbf{P}olicy \textbf{O}ptimization (\textbf{NPO}), a simple mixed-policy scheme that learns from a policy's own near-future self: a later…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
