TL;DR
HybridDeepSearcher introduces a structured hybrid search method combining parallel query expansion and explicit evidence aggregation, significantly enhancing reasoning capabilities and test-time search scaling in large reasoning models.
Contribution
It proposes a novel hybrid search strategy and a supervised dataset, HDS-QA, to improve reasoning depth and evidence coverage in large reasoning models.
Findings
Outperforms state-of-the-art on five benchmarks.
Achieves +15.9 F1 improvement on FanOutQA.
Demonstrates consistent performance gains with increased search steps.
Abstract
Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step reasoning with external knowledge retrieval. However, we find that existing approaches rarely demonstrate test-time search scaling. Methods that extend reasoning through single-query sequential search suffer from limited evidence coverage, while approaches that generate multiple independent queries per step often lack structured aggregation, hindering deeper sequential reasoning. We propose a hybrid search strategy to address these limitations. We introduce HybridDeepSearcher, a structured search agent that integrates parallel query expansion with explicit evidence aggregation before advancing to deeper sequential reasoning. To supervise this behavior, we introduce HDS-QA, a novel dataset that guides models to combine broad parallel search with…
Peer Reviews
Decision·ICLR 2026 Poster
The strengths of this paper are: - Authors compare to multiple recent methods that serve as strong baselines. - HybridDeepSearcher beats baselines on multiple datasets. Some of these datasets like BrowseComp are very challenging QA datasets. - The scaling curves convincingly show that HybridDeepSearcher is able to use extra tool calls and search turns effectively.
The weaknesses of this paper are: - Parallel and multi-hop questions are generated using one fixed pipeline. This might lead to limited diversity and many questions might have very similar patterns. - Is HDS-QA only useful for improving search capabilities in 7-8b models or do smaller/larger models also benefit from training on HDS-QA? For larger models you likely need traces from a larger teacher model? - Experiments are limited to SFT on HDS-QA and there no experiments to see if performance c
1. The paper is well-written with clear illustrations. 2. The model training recipe is clean and systematic. 3. The dataset is generated using Google’s People Also Ask feature, which seems to have significantly increased the difficulty and quality of the generated questions. 4. The experimental evaluation is comprehensive with a sufficient number of datasets and baselines. The analytical studies on test-time scaling are also insightful.
1. The pipeline still uses a bigger model Qwen3-32B to summarize retrieved documents, which may incur unwanted costs. Have the authors examined the performance using a smaller model, like the 8B model, as a summarizer? 2. As the primary contribution is a high-quality data synthesis pipeline, the paper did not explore the scaling behavior with respect to the number of generated data. It would be interesting to explore questions like how much data is needed, and whether more data helps.
1. The paper raised attention to parallel search, which is an under-studied capability of deep research models. 2. Using parallel search might enable the model with extended capability to gather information and more efficient utilization of context length.
1. The paper highlights its novelty on doing parallel search in addition to sequential search; however, the main change can be understood as adding a parser on the tool side to parse a list of queries from a single query string, which seems trivial to me. This has been implemented in some open-source projects like MiroMind as a small feature. 2. The construction of the dataset claims to require parallel search, but the ablation failed to accurately demonstrate how training a sequential search mo
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
