Performance Characterization of Expert Router for Scalable LLM Inference

Josef Pichlmeier; Philipp Ross; Andre Luckow

arXiv:2404.15153·cs.CL·October 10, 2024

Performance Characterization of Expert Router for Scalable LLM Inference

Josef Pichlmeier, Philipp Ross, Andre Luckow

PDF

Open Access

TL;DR

This paper evaluates Expert Router, a scalable routing system for deploying specialized LLMs efficiently, demonstrating minimal latency overhead and stable performance across various configurations and user loads.

Contribution

It introduces and characterizes Expert Router, a novel modular routing architecture for scalable LLM inference with diverse expert model configurations.

Findings

01

High-parameter experts maintain stable throughput at moderate concurrency.

02

Smaller experts outperform tensor-parallel models at high concurrency.

03

Expert Router adds minimal latency overhead.

Abstract

Large Language Models (LLMs) have experienced widespread adoption across scientific and industrial domains due to their versatility and utility for diverse tasks. Nevertheless, deploying and serving these models at scale with optimal throughput and latency remains a significant challenge, primarily because of LLMs' high computational and memory demands. Specialized models optimized for specific tasks can be combined through a routing mechanism to address these challenges, creating a modular inference system. This paper introduces Expert Router, a scalable routing architecture that directs prompts to specialized expert models. We characterize multiple Expert Router configurations, including different LLama 3 models with quantized and non-quantized weights under up to 1,000 concurrent users. Our findings reveal that Expert Router introduces minimal latency overhead, with the configuration…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Expert finding and Q&A systems · Data Quality and Management

MethodsLLaMA