TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing

Pei Yang; Wanyi Chen; Tongyun Yang; Pengbin Feng; Jiarong Xing; Wentao Guo; Yuhang Yao; Yuhang Han; Hanchen Li; Xu Wang; Zeyu Wang; Jie Xiao; Anjie Yang; Liang Tian; Lynn Ai; Eric Yang; Tianyu Shi

arXiv:2605.18859·cs.LG·May 20, 2026

TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing

Pei Yang, Wanyi Chen, Tongyun Yang, Pengbin Feng, Jiarong Xing, Wentao Guo, Yuhang Yao, Yuhang Han, Hanchen Li, Xu Wang, Zeyu Wang, Jie Xiao, Anjie Yang, Liang Tian, Lynn Ai, Eric Yang, Tianyu Shi

PDF

1 Repo 1 Datasets

TL;DR

TwinRouterBench is a comprehensive benchmark for evaluating static and dynamic LLM routing at each agent step, enabling cost-effective model selection without online judging, and supporting fast offline testing and real-world validation.

Contribution

It introduces a novel step-level routing benchmark with static and dynamic tracks, including a large static dataset and a live evaluation harness, for realistic agentic LLM routing assessment.

Findings

01

Provides deterministic scoring without online LLM judges.

02

Includes 970 static prefixes and 500 dynamic cases across multiple datasets.

03

Enables fast offline iteration and end-to-end validation.

Abstract

LLM routing matters most in long-horizon applications such as coding agents, deep research systems, and computer-use agents, where a single user request triggers many model calls. Routing each call to the cheapest sufficient model can cut costs without sacrificing quality, yet existing router benchmarks evaluate routers only on one-shot prompts. They never expose the router-visible prefix at an intermediate agent step, never test whether a cheaper replacement preserves downstream task success, and often rely on online LLM judges at evaluation time. We introduce TwinRouterBench, a step-level routing benchmark with two tracks. The static track provides 970 router-visible prefixes from 520 instances across SWE-bench, BFCL, mtRAG, QMSum, and PinchBench, each paired with an execution-verified target tier estimated under a released downgrade-and-cascade protocol; scoring is deterministic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CommonstackAI/TwinRouterBench
github

Datasets

Amorph/TwinRouterBench
dataset· 119 dl
119 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.