HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools
Aashna Garg, Siddharth Singha Roy, Jinu Jang, Federico Brancasi, Shengyu Fu

TL;DR
HyDRA is a flexible, cost-effective routing framework for heterogeneous LLM pools that predicts query requirements and dynamically matches models without retraining, improving efficiency and language-invariance.
Contribution
HyDRA introduces a decoupled, multi-dimensional routing architecture that adapts to catalog changes and supports language-invariant routing, with demonstrated cost savings and maintained quality.
Findings
Median inference latency is 86 ms in production.
Achieves up to 75.4% resolution accuracy at 12.9% cost savings.
Generalizes across multiple benchmarks and languages.
Abstract
Production LLM deployments increasingly maintain heterogeneous model pools spanning order-of-magnitude cost differences. Existing routers make binary strong-vs-weak decisions and couple learned parameters to specific model identities, requiring retraining whenever the catalog changes. We present HyDRA (Hybrid Dynamic Routing Architecture), a framework that predicts fine-grained, multi-dimensional capability requirements per query and matches them against configuration-defined model profiles via shortfall matching. A ModernBERT encoder with K=4 independent sigmoid heads scores each query along reasoning, code generation, debugging, and tool use; a shortfall-matching algorithm then selects the cheapest model whose capabilities meet the predicted requirements. The deployed predictor runs at 86 ms median CPU inference latency in production, and is fully decoupled from the model catalog --…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
