FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

Chiyu Ma; Shuo Yang; Kexin Huang; Jinda Lu; Haoming Meng; Shangshang Wang; Bolin Ding; Soroush Vosoughi; Guoyin Wang; Jingren Zhou

arXiv:2603.19835·cs.LG·April 1, 2026

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush Vosoughi, Guoyin Wang, Jingren Zhou

PDF

1 Repo 1 Models

TL;DR

FIPO introduces a reinforcement learning algorithm that enhances large language models' reasoning by using future-KL divergence for dense advantage calculation, leading to longer reasoning chains and improved accuracy.

Contribution

It proposes a novel dense advantage formulation using future-KL divergence, improving reasoning length and accuracy in large language models.

Findings

01

Extended chain-of-thought length from 4,000 to over 10,000 tokens.

02

Increased AIME 2024 Pass@1 accuracy from 50.0% to 58.0%.

03

Outperformed baseline models in reasoning tasks.

Abstract

We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qwenpilot/FIPO
github

Models

🤗
QwenPilot/FIPO_32B
model· 49 dl· ♡ 3
49 dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.