TensorOpera Router: A Multi-Model Router for Efficient LLM Inference

Dimitris Stripelis; Zijian Hu; Jipeng Zhang; Zhaozhuo Xu; Alay; Dilipbhai Shah; Han Jin; Yuhang Yao; Salman Avestimehr; and Chaoyang He

arXiv:2408.12320·cs.AI·October 25, 2024

TensorOpera Router: A Multi-Model Router for Efficient LLM Inference

Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay, Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, and Chaoyang He

PDF

Open Access

TL;DR

TensorOpera Router (TO-Router) is a multi-model system that dynamically routes queries to different LLMs, improving efficiency and reducing costs while maintaining or improving performance.

Contribution

We introduce TO-Router, a system that integrates multiple LLMs and intelligently routes queries to optimize efficiency, cost, and performance.

Findings

01

Up to 40% improvement in query efficiency

02

Cost reductions of up to 30%

03

Performance maintained or improved by up to 10%

Abstract

With the rapid growth of Large Language Models (LLMs) across various domains, numerous new LLMs have emerged, each possessing domain-specific expertise. This proliferation has highlighted the need for quick, high-quality, and cost-effective LLM query response methods. Yet, no single LLM exists to efficiently balance this trilemma. Some models are powerful but extremely costly, while others are fast and inexpensive but qualitatively inferior. To address this challenge, we present TO-Router, a non-monolithic LLM querying system that seamlessly integrates various LLM experts into a single query interface and dynamically routes incoming queries to the most high-performant expert based on query's requirements. Through extensive experiments, we demonstrate that when compared to standalone expert models, TO-Router improves query efficiency by up to 40\%, and leads to significant cost…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Algorithms and Data Compression · Network Packet Processing and Optimization