Memory Offloading for Large Language Model Inference with Latency SLO   Guarantees

Chenxiang Ma; Zhisheng Ye; Hanyu Zhao; Zehua Yang; Tianhao Fu; Jiaxun; Han; Jie Zhang; Yingwei Luo; Xiaolin Wang; Zhenlin Wang; Yong Li; Diyu Zhou

arXiv:2502.08182·cs.DC·February 13, 2025

Memory Offloading for Large Language Model Inference with Latency SLO Guarantees

Chenxiang Ma, Zhisheng Ye, Hanyu Zhao, Zehua Yang, Tianhao Fu, Jiaxun, Han, Jie Zhang, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, Yong Li, Diyu Zhou

PDF

Open Access

TL;DR

This paper introduces Select-N, a latency-aware memory offloading system for large language models that optimizes host memory usage while ensuring latency SLOs, significantly improving inference throughput.

Contribution

Select-N uniquely exploits deterministic decoder layer times in LLMs to balance latency guarantees with memory utilization through an adaptive offloading interval.

Findings

01

Select-N consistently meets latency SLOs.

02

It improves inference throughput by 1.85 times.

03

Maximizes host memory use without violating latency constraints.

Abstract

Offloading large language models (LLMs) state to host memory during inference promises to reduce operational costs by supporting larger models, longer inputs, and larger batch sizes. However, the design of existing memory offloading mechanisms does not take latency service-level objectives (SLOs) into consideration. As a result, they either lead to frequent SLO violations or underutilize host memory, thereby incurring economic loss and thus defeating the purpose of memory offloading. This paper presents Select-N, a latency-SLO-aware memory offloading system for LLM serving. A key challenge in designing Select-N is to reconcile the tension between meeting SLOs and maximizing host memory usage. Select-N overcomes it by exploiting a unique characteristic of modern LLMs: during serving, the computation time of each decoder layer is deterministic. Leveraging this, Select-N introduces…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis