AQUA: Network-Accelerated Memory Offloading for LLMs in Scale-Up GPU   Domains

Abhishek Vijaya Kumar; Gianni Antichi; Rachee Singh

arXiv:2407.21255·cs.DC·February 24, 2025

AQUA: Network-Accelerated Memory Offloading for LLMs in Scale-Up GPU Domains

Abhishek Vijaya Kumar, Gianni Antichi, Rachee Singh

PDF

Open Access

TL;DR

Aqua is a GPU memory management framework that enables responsive and high-throughput inference for large language models by reducing paging overhead, especially under bursty request loads.

Contribution

Aqua introduces a novel preemptive scheduling and memory management approach that significantly reduces paging overhead for LLM inference on GPUs.

Findings

01

Aqua improves inference responsiveness by 20X.

02

Aqua increases throughput over a single long prompt by 4X.

03

Aqua effectively manages GPU memory for multiple LLM modalities.

Abstract

Inference on large-language models (LLMs) is constrained by GPU memory capacity. A sudden increase in the number of inference requests to a cloud-hosted LLM can deplete GPU memory, leading to contention between multiple prompts for limited resources. Modern LLM serving engines deal with the challenge of limited GPU memory using admission control, which causes them to be unresponsive during request bursts. We propose that preemptive scheduling of prompts in time slices is essential for ensuring responsive LLM inference, especially under conditions of high load and limited GPU memory. However, preempting prompt inference incurs a high paging overhead, which reduces inference throughput. We present Aqua, a GPU memory management framework that significantly reduces the overhead of paging inference state, achieving both responsive and high-throughput inference even under bursty request…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications