DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research
Jo\~ao Coelho, Jingjie Ning, Jingyuan He, Kangrui Mao, Abhijay Paladugu, Pranav Setlur, Jiahe Jin, Jamie Callan, Jo\~ao Magalh\~aes, Bruno Martins, Chenyan Xiong

TL;DR
DeepResearchGym is an open-source, reproducible evaluation sandbox for deep research systems that uses a large-scale, transparent search API and an extended benchmark to assess system performance and report quality.
Contribution
It introduces a reproducible search API and evaluation protocol for benchmarking deep research systems, addressing transparency and cost issues of commercial APIs.
Findings
Systems with DeepResearchGym achieve performance comparable to commercial APIs.
Evaluation metrics remain consistent across different systems.
Models trained within the sandbox can generalize to commercial search.
Abstract
Deep research systems represent an emerging class of agentic information retrieval methods that generate comprehensive and well-supported reports to complex queries. However, most existing frameworks rely on dynamic commercial search APIs, which pose reproducibility and transparency challenges in addition to their cost. To address these limitations, we introduce \textsc{DeepResearchGym} as an open-source sandbox that combines a reproducible search API with a rigorous evaluation protocol for benchmarking deep research systems. The API indexes large-scale public web corpora, namely ClueWeb22 and FineWeb, using a state-of-the-art dense retriever and approximate nearest neighbor search via DiskANN. It achieves lower latency than popular commercial APIs while ensuring stable document rankings across runs, and is free for research use. To evaluate deep research systems' outputs, we extend the…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper contributes an open-source sandbox that can potentially facilitate reproducible research in deep research agents. 2. The authors deliver some insight into fine-grained performance of popular deep research agents.
1. line 154-155: While the authors use recent data sources to construct the sandbox, it is unclear from the paper how to keep the sandbox "up-to-date". 2. The contribution beyond the sandbox is limited, as the authors mostly follow existing work in their evaluation protocol. 3. While I understand the importance of having a reproducible environment for benchmarking deep research agents, I fail to see the discussion on the actual benefit from this paper. What are the evaluation nuances that DeepR
1. **Addresses critical reproducibility gap**: The field urgently needs standardized benchmarking infrastructure. The paper directly tackles cost, transparency, and reproducibility issues with commercial APIs. 2. **Comprehensive multi-dimensional evaluation**: The three-faceted framework (relevance, faithfulness, quality) captures different aspects of report generation quality, going beyond simple surface-form metrics. 3. **Nuanced analysis**: Query-level correlation analysis (Figure 2) and quer
1. **Heavy reliance on a single judge model without robustness analysis**: All automatic evaluations use GPT-4.1-mini exclusively as the judge, with no ablation studies using alternative models such as GPT-4o, Claude Sonnet, or open-source alternatives. This creates potential concerns about judge-specific biases, particularly since some evaluated systems (like OpenAI's deep research) are based on GPT models. 2. **Insufficient examination of static corpus limitations and ground-truth quality**: W
1. This paper proposes DeepResearchGym, an open-source benchmarking framework specifically designed to enable transparent and reproducible evaluation of deep research systems. Being free and open-source makes this work a valuable resource to the community. 2. Empirical evaluations show that the system achieves strong retrieval quality with minimal loss from approximate search, as well as maintaining response times below those attained by commercial APIs. 3. The paper is well-written and easy t
1. Although DeepResearchGym helps the reproduction of deep research systems with a higher response speed and lower cost, using static corpora may under-serve very time-sensitive queries. 2. Despite showing high correlation with human evaluation, the empirical results only on Researchy Questions using GPT-4.1 as a judge might not comprehensively and robustly reflect the system's actual performance to serve as a replacement of search engine. Additionally, more fine-grained comparison between the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Explainable Artificial Intelligence (XAI)
