TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization

Shichao Ma; Zhiyuan Ma; Ming Yang; Xiaofan Li; Xing Wu; Jintao Du; Yu Cheng; Weiqiang Wang; Qiliang Liu; Zhengyang Zhou; Yang Wang

arXiv:2601.22776·cs.AI·April 7, 2026

TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization

Shichao Ma, Zhiyuan Ma, Ming Yang, Xiaofan Li, Xing Wu, Jintao Du, Yu Cheng, Weiqiang Wang, Qiliang Liu, Zhengyang Zhou, Yang Wang

PDF

1 Repo

TL;DR

This paper introduces TSPO, a novel reinforcement learning method that enhances multi-turn search reasoning in LLMs by providing step-level rewards, significantly improving performance over existing approaches.

Contribution

TSPO employs the First-Occurrence Latent Reward mechanism to preserve process signals and boost reward variance without external annotations, addressing the double homogenization dilemma.

Findings

01

TSPO achieves 24% average performance gain on Qwen2.5-3B models.

02

TSPO outperforms state-of-the-art baselines in multi-turn search reasoning.

03

The method enhances intra-group advantage estimation efficiency.

Abstract

Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on sparse outcome-level rewards, leading to a "Double Homogenization Dilemma." This manifests as (1) Process homogenization, where the thinking, reasoning, and tooling involved in generation are ignored. (2) Intra-group homogenization, coarse-grained outcome rewards often lead to inefficiencies in intra-group advantage estimation with methods like Group Relative Policy Optimization (GRPO) during sampling. To address this, we propose Turn-level Stage-aware Policy Optimization (TSPO). TSPO introduces the First-Occurrence Latent Reward (FOLR) mechanism, allocating partial rewards to the step where the ground-truth answer first appears, thereby…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Flipped-May/TSPO
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.