HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools

Aashna Garg; Siddharth Singha Roy; Jinu Jang; Federico Brancasi; Shengyu Fu

arXiv:2605.17106·cs.CL·May 19, 2026

HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools

Aashna Garg, Siddharth Singha Roy, Jinu Jang, Federico Brancasi, Shengyu Fu

PDF

TL;DR

HyDRA is a flexible, cost-effective routing framework for heterogeneous LLM pools that predicts query requirements and dynamically matches models without retraining, improving efficiency and language-invariance.

Contribution

HyDRA introduces a decoupled, multi-dimensional routing architecture that adapts to catalog changes and supports language-invariant routing, with demonstrated cost savings and maintained quality.

Findings

01

Median inference latency is 86 ms in production.

02

Achieves up to 75.4% resolution accuracy at 12.9% cost savings.

03

Generalizes across multiple benchmarks and languages.

Abstract

Production LLM deployments increasingly maintain heterogeneous model pools spanning order-of-magnitude cost differences. Existing routers make binary strong-vs-weak decisions and couple learned parameters to specific model identities, requiring retraining whenever the catalog changes. We present HyDRA (Hybrid Dynamic Routing Architecture), a framework that predicts fine-grained, multi-dimensional capability requirements per query and matches them against configuration-defined model profiles via shortfall matching. A ModernBERT encoder with K=4 independent sigmoid heads scores each query along reasoning, code generation, debugging, and tool use; a shortfall-matching algorithm then selects the cheapest model whose capabilities meet the predicted requirements. The deployed predictor runs at 86 ms median CPU inference latency in production, and is fully decoupled from the model catalog --…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.