Near-Future Policy Optimization

Chuanyu Qin; Chenxu Yang; Qingyi Si; Naibin Gu; Dingyu Yao; Zheng Lin; Peng Fu; Nan Duan; Jiaqi Wang

arXiv:2604.20733·cs.LG·April 23, 2026

Near-Future Policy Optimization

Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, Jiaqi Wang

PDF

TL;DR

This paper introduces NPO, a mixed-policy reinforcement learning scheme that uses a model's own near-future checkpoints to enhance training efficiency and performance, validated on a large language model.

Contribution

The paper proposes NPO, a novel method that leverages a model's future self for better off-policy trajectories, and introduces AutoNPO, an adaptive version that optimizes intervention timing.

Findings

01

NPO improves average performance from 57.88 to 62.84 on Qwen3-VL-8B-Instruct.

02

AutoNPO further increases performance to 63.15.

03

Both methods accelerate convergence and raise the performance ceiling.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe. Introducing suitable off-policy trajectories into on-policy exploration accelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existing mixed-policy methods either import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher $Q$ , more new knowledge to learn) and close enough (lower $V$ , more readily absorbed) conditions required to maximize the effective learning signal $S = Q / V$ . We propose \textbf{N}ear-Future \textbf{P}olicy \textbf{O}ptimization (\textbf{NPO}), a simple mixed-policy scheme that learns from a policy's own near-future self: a later…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.