TL;DR
FIPO introduces a reinforcement learning algorithm that enhances large language models' reasoning by using future-KL divergence for dense advantage calculation, leading to longer reasoning chains and improved accuracy.
Contribution
It proposes a novel dense advantage formulation using future-KL divergence, improving reasoning length and accuracy in large language models.
Findings
Extended chain-of-thought length from 4,000 to over 10,000 tokens.
Increased AIME 2024 Pass@1 accuracy from 50.0% to 58.0%.
Outperformed baseline models in reasoning tasks.
Abstract
We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
