TL;DR
Latent-GRPO introduces a novel reinforcement learning method for latent reasoning that overcomes stability issues, leading to more efficient and accurate reasoning with shorter chains across various benchmarks.
Contribution
It proposes Latent-GRPO, a new algorithm that addresses key challenges in latent RL reasoning, improving stability and performance over existing methods.
Findings
Latent-GRPO improves Pass@1 by 7.86 points on low-difficulty tasks.
It surpasses explicit GRPO by 4.27 points on high-difficulty tasks.
Achieves stronger pass@$k$ performance with shorter reasoning chains.
Abstract
Latent reasoning offers a more efficient alternative to explicit reasoning by compressing intermediate reasoning into continuous representations and substantially shortening reasoning chains. However, existing latent reasoning methods mainly focus on supervised learning, and reinforcement learning in latent space remains highly unstable. We study this problem through the lens of Group Relative Policy Optimization (GRPO), and show that directly adapting GRPO to latent reasoning is fundamentally non-trivial: latent reasoning changes both the probability density and the sampling mechanism, causing three coupled bottlenecks: absence of intrinsic latent manifolds, where unconstrained exploration pushes rollouts off the valid latent manifold; exploration-optimization misalignment, where trajectory-level rewards can induce incorrect token-level updates; and latent mixture non-closure, where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
