Gradient Extrapolation-Based Policy Optimization

Ismam Nur Swapnil; Aranya Saha; Tanvir Ahmed Khan; Mohammad Ariful Haque; Ser-Nam Lim

arXiv:2605.06755·cs.LG·May 11, 2026

Gradient Extrapolation-Based Policy Optimization

Ismam Nur Swapnil, Aranya Saha, Tanvir Ahmed Khan, Mohammad Ariful Haque, Ser-Nam Lim

PDF

TL;DR

GXPO is a novel policy optimization method that approximates multi-step lookahead in reinforcement learning for language models, improving efficiency and performance over standard methods.

Contribution

GXPO introduces a practical, efficient lookahead approximation technique that enhances reasoning RL for language models without additional rollout costs.

Findings

01

GXPO improves pass@1 accuracy by up to 5 points over GRPO.

02

GXPO achieves up to 4x step speedup and 2.33x wall-clock speedup.

03

GXPO maintains fixed backward-pass cost while enhancing performance.

Abstract

Reinforcement learning is widely used to improve the reasoning ability of large language models, especially when answers can be automatically checked. Standard GRPO-style training updates the model using only the current step, while full multi-step lookahead can give a better update direction but is too expensive because it needs many backward passes. We propose Gradient Extrapolation-Based Policy Optimization (GXPO), a plug-compatible policy-update rule for GRPO-style reasoning RL. GXPO approximates a longer local lookahead using only three backward passes during an active phase. It reuses the same batch of rollouts, rewards, advantages, and GRPO loss, so it does not require new rollouts or reward computation at the lookahead points. GXPO takes two fast optimizer steps, measures how the gradients change, predicts a virtual K-step lookahead point, moves the policy partway toward that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.