Splitwise: Efficient generative LLM inference using phase splitting

Pratyush Patel; Esha Choukse; Chaojie Zhang; Aashaka Shah; \'I\~nigo; Goiri; Saeed Maleki; Ricardo Bianchini

arXiv:2311.18677·cs.AR·May 21, 2024·6 cites

Splitwise: Efficient generative LLM inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, \'I\~nigo, Goiri, Saeed Maleki, Ricardo Bianchini

PDF

Open Access 4 Repos

TL;DR

Splitwise introduces a phase-splitting approach for LLM inference, separating prompt computation and token generation onto different machines, leading to significant improvements in throughput, cost, and power efficiency.

Contribution

The paper proposes a novel phase splitting technique for LLM inference, optimizing hardware utilization and resource provisioning for different inference phases.

Findings

01

Achieves 1.4x higher throughput at 20% lower cost.

02

Attains 2.35x more throughput with same cost and power.

03

Effectively utilizes hardware suited for each inference phase.

Abstract

Recent innovations in generative large language models (LLMs) have made their applications and use-cases ubiquitous. This has led to large-scale deployments of these models, using complex, expensive, and power-hungry AI accelerators, most commonly GPUs. These developments make LLM inference efficiency an important challenge. Based on our extensive characterization, we find that there are two main phases during an LLM inference request: a compute-intensive prompt computation, and a memory-intensive token generation, each with distinct latency, throughput, memory, and power characteristics. Despite state-of-the-art batching and scheduling, the token generation phase underutilizes compute resources. Specifically, unlike compute-intensive prompt computation phases, token generation phases do not require the compute capability of the latest GPUs, and can be run with lower power and cost.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Machine Learning in Materials Science