TL;DR
This paper introduces TSPO, a novel reinforcement learning method that enhances multi-turn search reasoning in LLMs by providing step-level rewards, significantly improving performance over existing approaches.
Contribution
TSPO employs the First-Occurrence Latent Reward mechanism to preserve process signals and boost reward variance without external annotations, addressing the double homogenization dilemma.
Findings
TSPO achieves 24% average performance gain on Qwen2.5-3B models.
TSPO outperforms state-of-the-art baselines in multi-turn search reasoning.
The method enhances intra-group advantage estimation efficiency.
Abstract
Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on sparse outcome-level rewards, leading to a "Double Homogenization Dilemma." This manifests as (1) Process homogenization, where the thinking, reasoning, and tooling involved in generation are ignored. (2) Intra-group homogenization, coarse-grained outcome rewards often lead to inefficiencies in intra-group advantage estimation with methods like Group Relative Policy Optimization (GRPO) during sampling. To address this, we propose Turn-level Stage-aware Policy Optimization (TSPO). TSPO introduces the First-Occurrence Latent Reward (FOLR) mechanism, allocating partial rewards to the step where the ground-truth answer first appears, thereby…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
