TensorOpera Router: A Multi-Model Router for Efficient LLM Inference
Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay, Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, and Chaoyang He

TL;DR
TensorOpera Router (TO-Router) is a multi-model system that dynamically routes queries to different LLMs, improving efficiency and reducing costs while maintaining or improving performance.
Contribution
We introduce TO-Router, a system that integrates multiple LLMs and intelligently routes queries to optimize efficiency, cost, and performance.
Findings
Up to 40% improvement in query efficiency
Cost reductions of up to 30%
Performance maintained or improved by up to 10%
Abstract
With the rapid growth of Large Language Models (LLMs) across various domains, numerous new LLMs have emerged, each possessing domain-specific expertise. This proliferation has highlighted the need for quick, high-quality, and cost-effective LLM query response methods. Yet, no single LLM exists to efficiently balance this trilemma. Some models are powerful but extremely costly, while others are fast and inexpensive but qualitatively inferior. To address this challenge, we present TO-Router, a non-monolithic LLM querying system that seamlessly integrates various LLM experts into a single query interface and dynamically routes incoming queries to the most high-performant expert based on query's requirements. Through extensive experiments, we demonstrate that when compared to standalone expert models, TO-Router improves query efficiency by up to 40\%, and leads to significant cost…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Algorithms and Data Compression · Network Packet Processing and Optimization
