Performance Characterization of Expert Router for Scalable LLM Inference
Josef Pichlmeier, Philipp Ross, Andre Luckow

TL;DR
This paper evaluates Expert Router, a scalable routing system for deploying specialized LLMs efficiently, demonstrating minimal latency overhead and stable performance across various configurations and user loads.
Contribution
It introduces and characterizes Expert Router, a novel modular routing architecture for scalable LLM inference with diverse expert model configurations.
Findings
High-parameter experts maintain stable throughput at moderate concurrency.
Smaller experts outperform tensor-parallel models at high concurrency.
Expert Router adds minimal latency overhead.
Abstract
Large Language Models (LLMs) have experienced widespread adoption across scientific and industrial domains due to their versatility and utility for diverse tasks. Nevertheless, deploying and serving these models at scale with optimal throughput and latency remains a significant challenge, primarily because of LLMs' high computational and memory demands. Specialized models optimized for specific tasks can be combined through a routing mechanism to address these challenges, creating a modular inference system. This paper introduces Expert Router, a scalable routing architecture that directs prompts to specialized expert models. We characterize multiple Expert Router configurations, including different LLama 3 models with quantized and non-quantized weights under up to 1,000 concurrent users. Our findings reveal that Expert Router introduces minimal latency overhead, with the configuration…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Expert finding and Q&A systems · Data Quality and Management
MethodsLLaMA
