DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

Yuxuan Gao; Megan Wang; Yi Ling Yu; Zijian Carl Ma; Ao Qu

arXiv:2605.19099·cs.AI·May 20, 2026

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

Yuxuan Gao, Megan Wang, Yi Ling Yu, Zijian Carl Ma, Ao Qu

PDF

TL;DR

DecisionBench is a comprehensive benchmark platform for evaluating emergent delegation in long-horizon agentic workflows, enabling diverse assessments of routing, quality, and orchestration strategies.

Contribution

It introduces a versatile, open-source benchmark substrate with detailed metrics and evaluation protocols for emergent delegation in complex workflows.

Findings

01

Quality is consistent across different awareness conditions.

02

Routing fidelity varies significantly with delivery channel.

03

There is substantial unrealized potential for perfect delegation.

Abstract

We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. The substrate fixes a task suite (GAIA, tau-bench, BFCL multi-turn), a peer-model pool (11 models, 7 vendor families), a delegation interface (call_model plus an optional read_profile channel), a deterministic skill-annotation layer, and a multi-axis metric suite covering quality, cost, latency, delegation rate, routing fidelity-at-k, vendor self-preference, and a counterfactual-delegation ceiling. The substrate is agnostic to how peer information is generated or delivered, so learned routers, richer peer memories, adaptive profile construction, and multi-step delegation can all be evaluated against it. We characterize the substrate with a five-condition reference sweep on the full pool (n=23,375 task instances). Three benchmark-level findings emerge: (i) mean end-task quality is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.