Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards
Deokgyu Yoon, Hyungkyu Kang, Joongkyu Lee, Byeongchan Kim, Gyungin Shin, Sungrae Park, Min-hwan Oh

TL;DR
This paper introduces NFPO, a new reinforcement learning algorithm that improves reasoning in language models by balancing bias and variance through multi-step likelihood ratio correction.
Contribution
It proposes the N-Step Forward-Trace Policy Optimization (NFPO), bridging PPO surrogate objectives and exact policy gradients for better reasoning performance.
Findings
NFPO provides a tighter policy-improvement bound than standard PPO.
Experiments show NFPO consistently improves reasoning benchmark performance.
Theoretical analysis confirms NFPO's bias-variance trade-off control.
Abstract
Reinforcement learning with verifiable rewards (RLVR) plays a pivotal role in improving the reasoning ability of large language models. However, widely used PPO surrogate objectives are fundamentally local, as they rely on a local approximation of the exact policy gradient objective. While this approximation improves stability by reducing the variance induced by importance sampling, it also introduces structural bias into the surrogate objective, which must be controlled through trust region mechanisms. In this work, we introduce the -step forward trace, which augments the PPO surrogate objective using the cumulative likelihood ratio of the next tokens. Building on this idea, we propose -Step Forward-Trace Policy Optimization (NFPO), a practical RLVR algorithm that integrates the -step forward trace into the masked policy gradient framework. NFPO provides a continuous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
