Improving and Evaluating Open Deep Research Agents

Doaa Allabadi; Kyle Bradbury; and Jordan M. Malof

arXiv:2508.10152·cs.AI·January 9, 2026

Improving and Evaluating Open Deep Research Agents

Doaa Allabadi, Kyle Bradbury, and Jordan M. Malof

PDF

3 Reviews

TL;DR

This paper introduces improvements to open deep research agents, notably ODR+, which achieves a 10% success rate on a new benchmark, advancing the evaluation of autonomous internet-based research systems.

Contribution

The paper adapts a new benchmark for open research agents, proposes strategic enhancements to ODR, and demonstrates improved performance over existing proprietary and open-source systems.

Findings

01

ODR+ achieves 10% success rate on BC-Small benchmark.

02

All tested systems initially scored 0% on the benchmark.

03

Three strategic improvements significantly enhanced ODR's performance.

Abstract

We focus here on Deep Research Agents (DRAs), which are systems that can take a natural language prompt from a user, and then autonomously search for, and utilize, internet-based content to address the prompt. Recent DRAs have demonstrated impressive capabilities on public benchmarks however, recent research largely involves proprietary closed-source systems. At the time of this work, we only found one open-source DRA, termed Open Deep Research (ODR). In this work we adapt the challenging recent BrowseComp benchmark to compare ODR to existing proprietary systems. We propose BrowseComp-Small (BC-Small), comprising a subset of BrowseComp, as a more computationally-tractable DRA benchmark for academic labs. We benchmark ODR and two other proprietary systems on BC-Small: one system from Anthropic and one system from Google. We find that all three systems achieve 0% accuracy on the test set…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 2

Strengths

1. The paper proposes three key strategies to improve the Open Deep Research (ODR) system. 2. The paper will release the code for ODR+ to support continued progress.

Weaknesses

1. Details are not provided on how this smaller subset of BrowseComp-Small is selected. The original BrowseComp contains information such as distribution of topics of the test set, human performance on questions etc. Such details are missing for BrowseComp-Small which makes it hard to understand which skills are being tested by BrowseComp-Small and what it means to have 0% accuracy on this set. 2. The paper argues that BrowseComp-Small is a smaller and more practical subset of BrowseComp but co

Reviewer 02Rating 2Confidence 4

Strengths

1. Practical contribution: introduces a smaller, more accessible subset (BC-Small) for BrowseComp-style evaluation. 2. Ablation sanity checks: removing any major component degrades performance, demonstrating each stage matters. 3. Focused study on an important emerging topic (LLM-based deep research agents) that the community cares about.

Weaknesses

1. Statistical rigor is weak — 6/60 correct (10%) with no confidence intervals, variance estimates, or bootstrap analysis makes the result fragile. 2. Baseline comparison fairness concerns — proprietary systems evaluated without structured-output wrappers and without similar tuning; zero-shot vs trained split mismatch. I also did not understand what "tuning" meant for their ODR+ system. 3. Ablations too limited — only on a 20-question subset; no partial ablations or sensitivity studies (e.g.,

Reviewer 03Rating 6Confidence 2

Strengths

1.The paper correctly identifies and addresses the core problem hindering open-source DRA development: the lack of an available baseline and benchmark . 2.The paper does an excellent job in reproducibility , providing pseudocode (Algorithm 1) , prompt summaries (Table 1) , evaluation methodology , and hyperparameters , which is critical for subsequent research. 3.The paper is well-organized and clearly written. The authors clearly articulate a complex problem and present a logically sound solu

Weaknesses

1.The paper claims that ODR+ "achieves the current state-of-the-art (SOTA) performance on the BrowseComp benchmark among open-source models". This is a very misleading claim because the only open-source model it was experimentally compared against was the original ODR. ODR's accuracy was 0%, which is an extremely low baseline; beating this alone is not sufficient to claim "SOTA". The authors mentioned other contemporary open-source DRAs in their related work section, such as DeepResearcher and W

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.