Pie: Pooling CPU Memory for LLM Inference

Yi Xu; Ziming Mao; Xiangxi Mo; Shu Liu; Ion Stoica

arXiv:2411.09317·cs.LG·November 15, 2024

Pie: Pooling CPU Memory for LLM Inference

Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, Ion Stoica

PDF

Open Access

TL;DR

Pie is a novel LLM inference framework that efficiently manages CPU-GPU memory swapping using performance-transparent techniques and adaptive expansion, significantly improving throughput and reducing GPU memory usage.

Contribution

Pie introduces a new memory swapping and expansion approach that leverages hardware predictability and real-time adaptation for efficient LLM inference.

Findings

01

Pie outperforms vLLM by up to 1.9X in throughput.

02

Pie reduces GPU memory usage by up to 1.67X.

03

Pie achieves lower latency and higher throughput than FlexGen.

Abstract

The rapid growth of LLMs has revolutionized natural language processing and AI analysis, but their increasing size and memory demands present significant challenges. A common solution is to spill over to CPU memory; however, traditional GPU-CPU memory swapping often results in higher latency and lower throughput. This paper introduces Pie, an LLM inference framework that addresses these challenges with performance-transparent swapping and adaptive expansion. By leveraging predictable memory access patterns and the high bandwidth of modern hardware like the NVIDIA GH200 Grace Hopper Superchip, Pie enables concurrent data swapping without affecting foreground computation, expanding effective memory without added latency. Adaptive expansion dynamically adjusts CPU memory allocation based on real-time information, optimizing memory usage and performance under varying conditions. Pie…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies