TL;DR
SLATE introduces a novel step-level sampling and dense reward approach that significantly improves retrieval-augmented reasoning in large language models by reducing variance and providing richer supervision.
Contribution
It proposes truncated step-level sampling with variance reduction and dense, decomposed process rewards, advancing step-level reinforcement learning for retrieval-augmented reasoning.
Findings
SLATE outperforms baselines on seven QA benchmarks.
Achieves 7.0% improvement over Search-R1 on 7B model.
Gains are largest on multi-hop tasks.
Abstract
Reinforcement learning has emerged as an effective paradigm for training large language models to interleave reasoning with search engine calls. However, existing approaches face a fundamental credit assignment problem: methods like Search-R1 assign a single outcome reward to the entire multi-step trajectory, providing no signal about which reasoning or retrieval decisions were responsible for success or failure. Process-reward methods such as StepSearch introduce step-level supervision but still sample complete trajectories independently, so advantage estimates at any given step are contaminated by the randomness of all other steps. We propose SLATE (Step-Level Advantage estimation for Truncated Exploration), which addresses both problems through two complementary ideas. First, truncated step-level sampling generates k continuations from a shared prefix, isolating all variation to a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
