DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
Yuxuan Gao, Megan Wang, Yi Ling Yu, Zijian Carl Ma, Ao Qu

TL;DR
DecisionBench is a comprehensive benchmark platform for evaluating emergent delegation in long-horizon agentic workflows, enabling diverse assessments of routing, quality, and orchestration strategies.
Contribution
It introduces a versatile, open-source benchmark substrate with detailed metrics and evaluation protocols for emergent delegation in complex workflows.
Findings
Quality is consistent across different awareness conditions.
Routing fidelity varies significantly with delivery channel.
There is substantial unrealized potential for perfect delegation.
Abstract
We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. The substrate fixes a task suite (GAIA, tau-bench, BFCL multi-turn), a peer-model pool (11 models, 7 vendor families), a delegation interface (call_model plus an optional read_profile channel), a deterministic skill-annotation layer, and a multi-axis metric suite covering quality, cost, latency, delegation rate, routing fidelity-at-k, vendor self-preference, and a counterfactual-delegation ceiling. The substrate is agnostic to how peer information is generated or delivered, so learned routers, richer peer memories, adaptive profile construction, and multi-step delegation can all be evaluated against it. We characterize the substrate with a five-condition reference sweep on the full pool (n=23,375 task instances). Three benchmark-level findings emerge: (i) mean end-task quality is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
