TL;DR
This paper introduces A$^2$TGPO, a novel reinforcement learning method for large language models that adaptively normalizes, accumulates, and clips turn-level signals to improve multi-turn interaction training.
Contribution
It proposes a new approach to leverage intrinsic information gain signals with adaptive normalization, accumulation, and clipping, addressing systematic challenges in RL training of agentic LLMs.
Findings
Normalized IG within turn groups improves stability.
Variance-rescaled accumulation maintains consistent advantage magnitudes.
Adaptive clipping enhances policy updates based on turn informativeness.
Abstract
Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assignment either depend on separate external process reward models that introduce additional consumption, or tree-based structural rollout that merely redistributes the outcome signal while constraining trajectory diversity. A promising alternative leverages the per-turn change in the policy's predicted probability of the ground-truth, termed Information Gain (IG), as an intrinsic process signal without an external evaluator. However, prior work on leveraging IG signals within the RL training loop faces three systematic challenges: normalizing across turns that face heterogeneous positional contexts can distort the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
