Think Before You Retrieve: Learning Test-Time Adaptive Search with Small Language Models
Supriti Vijay, Aman Priyanshu, Anu Vellore, Baturay Saglam, and Amin Karbasi

TL;DR
This paper introduces Orion, a training framework enabling small language models to perform iterative, strategic retrieval by learning search, reflection, and revision behaviors, significantly improving retrieval success across multiple benchmarks.
Contribution
Orion is a novel training approach that combines synthetic data, reinforcement learning, and inference algorithms to teach small models effective search and revision strategies for information retrieval.
Findings
Achieves 77.6% success on SciFact with a 1.2B model
Outperforms larger retrievers on multiple benchmarks
Demonstrates learned strategies can rival scale-based performance
Abstract
Effective information retrieval requires reasoning over partial evidence and refining strategies as information emerges. Yet current approaches fall short: neural retrievers lack reasoning capabilities, large language models (LLMs) provide semantic depth but at prohibitive cost, and query rewriting or decomposition limits improvement to static transformations. As a result, existing methods fail to capture the iterative dynamics of exploration, feedback, and revision that complex user queries demand. We introduce Orion, a training framework that enables compact models (350M-1.2B parameters) to perform iterative retrieval through learned search strategies. Orion combines: (1) synthetic trajectory generation and supervised fine-tuning to encourage diverse exploration patterns in models, (2) reinforcement learning (RL) that rewards effective query refinement and backtracking behaviors, and…
Peer Reviews
Decision·Submitted to ICLR 2026
The problem motivated is reasonable. The analysis of search behaviors (Table 5) is a very interesting idea.
To someone who has worked in this area for several years, the writing appears almost confused or contradictory across parts of the paper. Parts of the paper seem to suggest that the emphasis is on finetuning retrieval models with LLMs via RL, other parts seem to indicate that only the reasoning LLM that generates queries is being finetuned. To begin with, the problem of multi-hop retrieval is old; its modern instantiation is at least as old as HotPotQA (2018), a paper with over 3000 citations a
First, a framework is proposed that enables small models to have adaptive retrieval capabilities, significantly reducing reliance on large models. Second, by combining reinforcement learning and structured reasoning labeling, the model can proactively reflect and backtrack during the retrieval process, improving search accuracy. Third, experiments demonstrate that small models can outperform large models on multiple complex tasks through learning strategies, exhibiting high efficiency and prac
My main concern is that the novelty of this paper is quite limited, as its core idea is very similar to works [1,2], yet there is no relevant discussion or comparison in the main text. Although the appendix briefly mentions differences from Search-R1, claiming that Orion avoids the complexity of external knowledge bases, I believe this statement is inaccurate because Search-R1 itself was also implemented with offline corpora and lightweight retrieval components. Moreover, the paper should includ
The paper introduces a new concept of test-time adaptive retrieval reasoning for small language models, showing that even models without large-scale reasoning capacity can learn to think about what to search for before issuing queries. This reframes retrieval as an interactive reasoning process, not a static lookup. Experiments across multiple benchmarks show consistent and significant improvements over both static and multi-query retrievers. Small models trained with Orion approach the perform
1. The paper does not include a baseline that follows the standard two-stage SFT + RL training pipeline. A more complete comparison would involve first performing rejection sampling SFT to warm up the model using high-quality queries (which could be generated through the proposed beam search or other strategies, followed by a evalutation on that specific query), and then applying reinforcement learning for further optimization. The current baseline, such as DeepRetrieval, only uses RL without an
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Topic Modeling · Multimodal Machine Learning Applications
