Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards

Deokgyu Yoon; Hyungkyu Kang; Joongkyu Lee; Byeongchan Kim; Gyungin Shin; Sungrae Park; Min-hwan Oh

arXiv:2605.20865·cs.LG·May 21, 2026

Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards

Deokgyu Yoon, Hyungkyu Kang, Joongkyu Lee, Byeongchan Kim, Gyungin Shin, Sungrae Park, Min-hwan Oh

PDF

TL;DR

This paper introduces NFPO, a new reinforcement learning algorithm that improves reasoning in language models by balancing bias and variance through multi-step likelihood ratio correction.

Contribution

It proposes the N-Step Forward-Trace Policy Optimization (NFPO), bridging PPO surrogate objectives and exact policy gradients for better reasoning performance.

Findings

01

NFPO provides a tighter policy-improvement bound than standard PPO.

02

Experiments show NFPO consistently improves reasoning benchmark performance.

03

Theoretical analysis confirms NFPO's bias-variance trade-off control.

Abstract

Reinforcement learning with verifiable rewards (RLVR) plays a pivotal role in improving the reasoning ability of large language models. However, widely used PPO surrogate objectives are fundamentally local, as they rely on a local approximation of the exact policy gradient objective. While this approximation improves stability by reducing the variance induced by importance sampling, it also introduces structural bias into the surrogate objective, which must be controlled through trust region mechanisms. In this work, we introduce the $N$ -step forward trace, which augments the PPO surrogate objective using the cumulative likelihood ratio of the next $N - 1$ tokens. Building on this idea, we propose $N$ -Step Forward-Trace Policy Optimization (NFPO), a practical RLVR algorithm that integrates the $N$ -step forward trace into the masked policy gradient framework. NFPO provides a continuous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.