StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization
Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, Yichao Wu

TL;DR
StepSearch introduces a step-wise reinforcement learning framework with detailed intermediate rewards to enhance multi-hop reasoning in LLMs, significantly improving search-based QA performance with limited training data.
Contribution
The paper proposes a novel step-wise proximal policy optimization method with fine-grained supervision for training search LLMs, outperforming previous global-reward approaches.
Findings
Achieved 11.2% and 4.2% absolute improvements on multi-hop QA benchmarks for 3B and 7B models.
Demonstrated effectiveness of step-wise supervision with only 19k training samples.
Constructed a fine-grained dataset with sub-question search trajectories for training and evaluation.
Abstract
Efficient multi-hop reasoning requires Large Language Models (LLMs) based agents to acquire high-value external knowledge iteratively. Previous work has explored reinforcement learning (RL) to train LLMs to perform search-based document retrieval, achieving notable improvements in QA performance, but underperform on complex, multi-hop QA resulting from the sparse rewards from global signal only. To address this gap in existing research, we introduce StepSearch, a framework for search LLMs that trained with step-wise proximal policy optimization method. It consists of richer and more detailed intermediate search rewards and token-level process supervision based on information gain and redundancy penalties to better guide each search step. We constructed a fine-grained question-answering dataset containing sub-question-level search trajectories based on open source datasets through a set…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsOpen Education and E-Learning · Digital Rights Management and Security · Semantic Web and Ontologies
MethodsSparse Evolutionary Training
