Gradient Extrapolation-Based Policy Optimization
Ismam Nur Swapnil, Aranya Saha, Tanvir Ahmed Khan, Mohammad Ariful Haque, Ser-Nam Lim

TL;DR
GXPO is a novel policy optimization method that approximates multi-step lookahead in reinforcement learning for language models, improving efficiency and performance over standard methods.
Contribution
GXPO introduces a practical, efficient lookahead approximation technique that enhances reasoning RL for language models without additional rollout costs.
Findings
GXPO improves pass@1 accuracy by up to 5 points over GRPO.
GXPO achieves up to 4x step speedup and 2.33x wall-clock speedup.
GXPO maintains fixed backward-pass cost while enhancing performance.
Abstract
Reinforcement learning is widely used to improve the reasoning ability of large language models, especially when answers can be automatically checked. Standard GRPO-style training updates the model using only the current step, while full multi-step lookahead can give a better update direction but is too expensive because it needs many backward passes. We propose Gradient Extrapolation-Based Policy Optimization (GXPO), a plug-compatible policy-update rule for GRPO-style reasoning RL. GXPO approximates a longer local lookahead using only three backward passes during an active phase. It reuses the same batch of rollouts, rewards, advantages, and GRPO loss, so it does not require new rollouts or reward computation at the lookahead points. GXPO takes two fast optimizer steps, measures how the gradients change, predicts a virtual K-step lookahead point, moves the policy partway toward that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
