$S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data

Harsh Goel; Akhil Udathu; Susmija Jabireddy; Pradnesh Kalkar; Atharva Parulekar

arXiv:2605.01248·cs.LG·May 8, 2026

$S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data

Harsh Goel, Akhil Udathu, Susmija Jabireddy, Pradnesh Kalkar, Atharva Parulekar

PDF

TL;DR

This paper introduces S^3-R1, a framework that uses synthetic data and enhanced reward signals to improve reinforcement learning models' ability to perform multi-hop question answering through better search strategies.

Contribution

The paper presents a synthetic data generation pipeline and a reward structure that together improve RL models' search and reasoning capabilities for question answering.

Findings

01

S^3-R1 outperforms existing baselines in out-of-domain generalization.

02

Synthetic data with intermediate difficulty questions enhances training.

03

Reward design focusing on search quality improves answer accuracy.

Abstract

Reinforcement learning (RL) post-training has enabled newer capabilities in models, such as agentic tool-use for search. However, these models struggle primarily due to limitations with sparse outcome-based rewards and a lack of training data that encapsulates questions of differing hardness, which results in models not performing deeper searches with tools to collect evidence for question-answering. To address these limitations, we introduce S^3-R1 (Synthetic data and stabilized Search R1), a framework that couples a data-centric approach with denser learning signals. We first develop a synthetic generation and curation pipeline that programmatically derives diverse, multi-hop questions from existing documents. This pipeline incorporates a retrieval-based verification step to specifically isolate questions of intermediate difficulty. We then pair this expanded training set with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.