DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?

Young-Suk Lee; Ramon Fernandez Astudillo; Radu Florian

arXiv:2604.09251·cs.AI·April 24, 2026

DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?

Young-Suk Lee, Ramon Fernandez Astudillo, Radu Florian

PDF

TL;DR

DRBENCHER is a synthetic benchmark generator designed to evaluate research agents' abilities to perform web browsing and multi-step computation across diverse domains, addressing limitations of existing isolated benchmarks.

Contribution

It introduces a unified, answer-first pipeline with criteria for verifiability, complexity, difficulty, and diversity, covering five domains and highlighting current model limitations.

Findings

01

76% human-validated answer correctness, excluding stale data

02

Strongest model achieves only 20% answer accuracy

03

DRBENCHER surpasses existing benchmarks in semantic diversity

Abstract

Deep research agents increasingly interleave web browsing with multi-step computation, yet existing benchmarks evaluate these capabilities in isolation, creating a blind spot in assessing real-world performance. We introduce DRBENCHER, a synthetic benchmark generator for questions that require both browsing and computation. It enforces four criteria: verifiability (gold answers are computed by executing parameterized code over knowledge-graph values), complexity (multi-hop entity identification, property retrieval, and domain-specific computation), difficulty (a two-stage verification cascade filters out questions solvable by the generating model), and diversity (a greedy max-min embedding filter maximizes coverage). These criteria are realized via a unified answer-first pipeline spanning five domains: biochemistry, financial, geophysical, security, and history. Human evaluation shows…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.