SubSearch: Intermediate Rewards for Unsupervised Guided Reasoning in Complex Retrieval
Roxana Petcu, Evangelos Kanoulas, Maarten de Rijke

TL;DR
SubSearch introduces intermediate intrinsic rewards to guide unsupervised reasoning in large language models, improving multi-step reasoning and robustness in complex retrieval tasks without external supervision.
Contribution
It proposes a novel framework that directly optimizes reasoning processes with intrinsic rewards, reducing reliance on annotated trajectories and enhancing autonomous reasoning capabilities.
Findings
Rewarded intermediate reasoning steps improve robustness in QA tasks.
SubSearch outperforms outcome-only reward models on seven benchmarks.
Intrinsic rewards enable autonomous, data-efficient reasoning without external supervision.
Abstract
Large language models (LLMs) are probabilistic in nature and perform more reliably when augmented with external information. As complex queries often require multi-step reasoning over the retrieved information, with no clear or predetermined reasoning path, they remain challenging. Recent approaches train models using reinforcement learning on the model's outcome, showing promise in improving how models handle complex information. We introduce SubSearch, a specialized framework that shifts from outcome-only supervision to intermediate reward signals that incentivize planning high-quality reasoning. Unlike previous work on process reward modeling, which focuses on training a separate reward model with annotated trajectories by either human annotators or large LLM judges, SubSearch directly optimizes the generator using intrinsic process rewards, which we define as internally-derived…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
