Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing
Kunal Jain, Anjaly Parayil, Ankur Mallick, Esha Choukse, Xiaoting Qin,, Jue Zhang, \'I\~nigo Goiri, Rujia Wang, Chetan Bansal, Victor R\"uhle, Anoop, Kulkarni, Steve Kofsky, Saravan Rajmohan

TL;DR
This paper introduces an intelligent, workload-aware routing system for LLM inference that significantly reduces response latency by considering workload phases and leveraging reinforcement learning.
Contribution
It proposes a novel heuristic-guided reinforcement learning router that improves load balancing and end-to-end latency for LLM inference workloads.
Findings
Over 11% latency reduction on public datasets
7.8% latency reduction on real workload data
Framework serves as a benchmark for LLM inference schedulers
Abstract
Large Language Model (LLM) workloads have distinct prefill and decode phases with different compute and memory requirements which should ideally be accounted for when scheduling input queries across different LLM instances in a cluster. However existing scheduling algorithms treat LLM workloads as monolithic jobs without considering the distinct characteristics of the two phases in each workload. This leads to sub-optimal scheduling and increased response latency. In this work, we start by characterizing factors affecting the response latency during LLM inference serving. We establish that better load balancing of inference requests across the available LLM instances can improve the end-to-end latency to a larger extent than merely focusing on optimizing the instance-level scheduler. Motivated by our findings, we propose a heuristic-guided reinforcement learning-based intelligent router…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware-Defined Networks and 5G · Distributed and Parallel Computing Systems · Mobile Agent-Based Network Management
