TL;DR
This paper introduces improvements to open deep research agents, notably ODR+, which achieves a 10% success rate on a new benchmark, advancing the evaluation of autonomous internet-based research systems.
Contribution
The paper adapts a new benchmark for open research agents, proposes strategic enhancements to ODR, and demonstrates improved performance over existing proprietary and open-source systems.
Findings
ODR+ achieves 10% success rate on BC-Small benchmark.
All tested systems initially scored 0% on the benchmark.
Three strategic improvements significantly enhanced ODR's performance.
Abstract
We focus here on Deep Research Agents (DRAs), which are systems that can take a natural language prompt from a user, and then autonomously search for, and utilize, internet-based content to address the prompt. Recent DRAs have demonstrated impressive capabilities on public benchmarks however, recent research largely involves proprietary closed-source systems. At the time of this work, we only found one open-source DRA, termed Open Deep Research (ODR). In this work we adapt the challenging recent BrowseComp benchmark to compare ODR to existing proprietary systems. We propose BrowseComp-Small (BC-Small), comprising a subset of BrowseComp, as a more computationally-tractable DRA benchmark for academic labs. We benchmark ODR and two other proprietary systems on BC-Small: one system from Anthropic and one system from Google. We find that all three systems achieve 0% accuracy on the test set…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper proposes three key strategies to improve the Open Deep Research (ODR) system. 2. The paper will release the code for ODR+ to support continued progress.
1. Details are not provided on how this smaller subset of BrowseComp-Small is selected. The original BrowseComp contains information such as distribution of topics of the test set, human performance on questions etc. Such details are missing for BrowseComp-Small which makes it hard to understand which skills are being tested by BrowseComp-Small and what it means to have 0% accuracy on this set. 2. The paper argues that BrowseComp-Small is a smaller and more practical subset of BrowseComp but co
1. Practical contribution: introduces a smaller, more accessible subset (BC-Small) for BrowseComp-style evaluation. 2. Ablation sanity checks: removing any major component degrades performance, demonstrating each stage matters. 3. Focused study on an important emerging topic (LLM-based deep research agents) that the community cares about.
1. Statistical rigor is weak — 6/60 correct (10%) with no confidence intervals, variance estimates, or bootstrap analysis makes the result fragile. 2. Baseline comparison fairness concerns — proprietary systems evaluated without structured-output wrappers and without similar tuning; zero-shot vs trained split mismatch. I also did not understand what "tuning" meant for their ODR+ system. 3. Ablations too limited — only on a 20-question subset; no partial ablations or sensitivity studies (e.g.,
1.The paper correctly identifies and addresses the core problem hindering open-source DRA development: the lack of an available baseline and benchmark . 2.The paper does an excellent job in reproducibility , providing pseudocode (Algorithm 1) , prompt summaries (Table 1) , evaluation methodology , and hyperparameters , which is critical for subsequent research. 3.The paper is well-organized and clearly written. The authors clearly articulate a complex problem and present a logically sound solu
1.The paper claims that ODR+ "achieves the current state-of-the-art (SOTA) performance on the BrowseComp benchmark among open-source models". This is a very misleading claim because the only open-source model it was experimentally compared against was the original ODR. ODR's accuracy was 0%, which is an extremely low baseline; beating this alone is not sufficient to claim "SOTA". The authors mentioned other contemporary open-source DRAs in their related work section, such as DeepResearcher and W
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
