Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints

Jelena Markovic-Voronov; Kayhan Behdin; Yuanda Xu; Zhengze Zhou; Zhipeng Wang; Rahul Mazumder

arXiv:2603.26796·cs.LG·March 31, 2026

Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints

Jelena Markovic-Voronov, Kayhan Behdin, Yuanda Xu, Zhengze Zhou, Zhipeng Wang, Rahul Mazumder

PDF

TL;DR

This paper introduces a batch-level, resource-aware routing framework for large language models that optimizes model assignment under cost and capacity constraints, improving robustness and efficiency.

Contribution

It proposes a novel batch-level routing method that accounts for uncertainty and resource limits, outperforming prior per-query approaches in constrained environments.

Findings

01

Robust routing improves accuracy by 1-14% over non-robust methods.

02

Batch-level routing outperforms per-query routing by up to 24% under adversarial batching.

03

Optimized instance allocation adds up to 3% gains over non-optimized strategies.

Abstract

We study the problem of routing queries to large language models (LLMs) under cost, GPU resources, and concurrency constraints. Prior per-query routing methods often fail to control batch-level cost, especially under non-uniform or adversarial batching. To address this, we propose a batch-level, resource-aware routing framework that jointly optimizes model assignment for each batch while respecting cost and model capacity limits. We further introduce a robust variant that accounts for uncertainty in predicted LLM performance, along with an offline instance allocation procedure that balances quality and throughput across multiple models. Experiments on two multi-task LLM benchmarks show that robustness improves accuracy by 1-14% over non-robust counterparts (depending on the performance estimator), batch-level routing outperforms per-query methods by up to 24% under adversarial batching,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.