PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization

Wonjoong Kim; Yeonjun In; Sangwu Park; Dongha Lee; Chanyoung Park

arXiv:2605.17877·cs.AI·May 19, 2026

PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization

Wonjoong Kim, Yeonjun In, Sangwu Park, Dongha Lee, Chanyoung Park

PDF

TL;DR

The paper introduces PAIR, a prefix-aware internal reward model that provides dense, step-level rewards for multi-turn agent optimization, improving credit assignment without external calls or ground-truth data.

Contribution

It proposes a novel two-stage model combining hidden-state probing and attention mechanisms to generate robust internal rewards in multi-step tasks.

Findings

01

PAIR achieves the highest AUROC on contaminated trajectories.

02

Operates at negligible inference cost.

03

Enables dense step-level reward signals without external calls.

Abstract

A significant hurdle for current LLMs is the execution of complex, multi-stage tasks. Group Relative Policy Optimization (GRPO) has been emerging as a leading choice, but its reliance on sparse outcome rewards severely limits credit assignment across intermediate steps. Existing remedies such as running full rollouts to assign step-level advantages, calling external LLM judges at each step, or computing intrinsic rewards that require ground-truth answers at every evaluation introduce significant costs or practical constraints. We hypothesize that internal correctness probing over LLM hidden states can be repurposed as a step-level reward signal, potentially addressing all of these limitations at once. However, existing probing research assumes clean inputs, and we first show that this assumption breaks down in multi-step settings: hidden-state probes degrade severely under prefix…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.