Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

Yunho Choi; Jongwon Lim; Woojin Ahn; Minjae Oh; Jeonghoon Shim; Yohan Jo

arXiv:2605.07579·cs.LG·May 12, 2026

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

Yunho Choi, Jongwon Lim, Woojin Ahn, Minjae Oh, Jeonghoon Shim, Yohan Jo

PDF

1 Repo

TL;DR

This paper introduces POISE, a method that uses a language model's internal states to estimate rewards, reducing variance and computational costs in reinforcement learning for large reasoning models.

Contribution

POISE leverages internal signals from the policy model to estimate rewards online, enabling more stable, efficient training without additional large-scale critics or multiple rollouts.

Findings

01

POISE matches DAPO performance with less compute on reasoning benchmarks.

02

The value estimator performs comparably to a separate large language model-based value model.

03

POISE generalizes well across various verifiable tasks.

Abstract

Reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models hinges on baseline estimation for variance reduction, but existing approaches pay a heavy price: PPO requires a policy-model scale critic, while GRPO needs multiple rollouts per prompt to keep its empirical group mean stable. We introduce Policy Optimization with Internal State Value Estimation), which obtains a baseline at negligible cost by using the policy model's internal signals already computed during the policy forward pass. A lightweight probe predicts the expected verifiable reward from the hidden states of the prompt and generated trajectory, as well as token-entropy statistics, and is trained online alongside the policy. To preserve gradient unbiasedness despite using trajectory-conditioned features, we introduce a cross-rollout construction that predicts each rollout's value from an independent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

holi-lab/POISE
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.