Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization
Yu Li, Sizhe Tang, Tian Lan

TL;DR
This paper introduces T-STAR, a framework that enhances multi-turn reasoning in large language model agents by constructing a Cognitive Tree to identify critical reasoning steps and improve policy optimization.
Contribution
It proposes a novel method to recover latent reward structures and synthesize corrective reasoning, leading to improved performance in complex reasoning tasks.
Findings
T-STAR outperforms strong baselines on various reasoning benchmarks.
The Cognitive Tree effectively consolidates trajectories for better credit assignment.
Critical reasoning steps are identified and leveraged for more effective policy updates.
Abstract
Reinforcement learning for Large Language Model agents is often hindered by sparse rewards in multi-step reasoning tasks. Existing approaches like Group Relative Policy Optimization treat sampled trajectories as independent chains, assigning uniform credit to all steps in each chain and ignoring the existence of critical steps that may disproportionally impact reasoning outcome. In this paper, we propose T-STAR(Tree-structured Self-Taught Agent Rectification), a framework that recovers the latent correlated reward structure across seemingly independent trajectories. Specifically, we consolidate trajectories into a unified Cognitive Tree by identifying and merging functionally similar steps/nodes. It enables an Introspective Valuation mechanism that back-propagates trajectory-level rewards through the tree to obtain a new notion of variance-reduced relative advantage at step-level. Using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
