TL;DR
This paper introduces POISE, a method that uses a language model's internal states to estimate rewards, reducing variance and computational costs in reinforcement learning for large reasoning models.
Contribution
POISE leverages internal signals from the policy model to estimate rewards online, enabling more stable, efficient training without additional large-scale critics or multiple rollouts.
Findings
POISE matches DAPO performance with less compute on reasoning benchmarks.
The value estimator performs comparably to a separate large language model-based value model.
POISE generalizes well across various verifiable tasks.
Abstract
Reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models hinges on baseline estimation for variance reduction, but existing approaches pay a heavy price: PPO requires a policy-model scale critic, while GRPO needs multiple rollouts per prompt to keep its empirical group mean stable. We introduce Policy Optimization with Internal State Value Estimation), which obtains a baseline at negligible cost by using the policy model's internal signals already computed during the policy forward pass. A lightweight probe predicts the expected verifiable reward from the hidden states of the prompt and generated trajectory, as well as token-entropy statistics, and is trained online alongside the policy. To preserve gradient unbiasedness despite using trajectory-conditioned features, we introduce a cross-rollout construction that predicts each rollout's value from an independent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
