Splitwise: Efficient generative LLM inference using phase splitting
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, \'I\~nigo, Goiri, Saeed Maleki, Ricardo Bianchini

TL;DR
Splitwise introduces a phase-splitting approach for LLM inference, separating prompt computation and token generation onto different machines, leading to significant improvements in throughput, cost, and power efficiency.
Contribution
The paper proposes a novel phase splitting technique for LLM inference, optimizing hardware utilization and resource provisioning for different inference phases.
Findings
Achieves 1.4x higher throughput at 20% lower cost.
Attains 2.35x more throughput with same cost and power.
Effectively utilizes hardware suited for each inference phase.
Abstract
Recent innovations in generative large language models (LLMs) have made their applications and use-cases ubiquitous. This has led to large-scale deployments of these models, using complex, expensive, and power-hungry AI accelerators, most commonly GPUs. These developments make LLM inference efficiency an important challenge. Based on our extensive characterization, we find that there are two main phases during an LLM inference request: a compute-intensive prompt computation, and a memory-intensive token generation, each with distinct latency, throughput, memory, and power characteristics. Despite state-of-the-art batching and scheduling, the token generation phase underutilizes compute resources. Specifically, unlike compute-intensive prompt computation phases, token generation phases do not require the compute capability of the latest GPUs, and can be run with lower power and cost.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Machine Learning in Materials Science
