OmniRouter: Budget and Performance Controllable Multi-LLM Routing
Kai Mei, Wujiang Xu, Minghao Guo, Shuhang Lin, Yongfeng Zhang

TL;DR
OmniRouter introduces a globally optimized multi-LLM routing framework that balances cost and performance, improving accuracy and reducing computational costs compared to existing methods.
Contribution
It models LLM routing as a constrained optimization problem and employs a hybrid predictor with a Lagrangian dual optimizer for globally optimal resource allocation.
Findings
Achieves up to 6.30% higher response accuracy.
Reduces computational costs by at least 10.15%.
Demonstrates effective global resource management in multi-LLM serving.
Abstract
Large language models (LLMs) deliver superior performance but require substantial computational resources and operate with relatively low efficiency, while smaller models can efficiently handle simpler tasks with fewer resources. LLM routing is a crucial paradigm that dynamically selects the most suitable large language models from a pool of candidates to process diverse inputs, ensuring optimal resource utilization while maintaining response quality. Existing routing frameworks typically model this as a locally optimal decision-making problem, selecting the presumed best-fit LLM for each query individually, which overlooks global budget constraints, resulting in ineffective resource allocation. To tackle this problem, we introduce OmniRouter, a fundamentally controllable routing framework for multi-LLM serving. Instead of making per-query greedy choices, OmniRouter models the routing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · IoT and Edge/Fog Computing · Service-Oriented Architecture and Web Services
