PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving

Ahmet Caner Y\"uz\"ug\"uler; Jiawei Zhuang; Lukas Cavigelli

arXiv:2501.08192·cs.AI·May 27, 2025

PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving

Ahmet Caner Y\"uz\"ug\"uler, Jiawei Zhuang, Lukas Cavigelli

PDF

Open Access

TL;DR

PRESERVE is a framework that prefetches model weights and KV-cache into on-chip memory during communication, significantly reducing inference latency and cost in distributed large language model serving.

Contribution

It introduces a novel prefetching approach that overlaps communication with computation, improving scalability and performance of distributed LLM inference systems.

Findings

01

Up to 1.6x end-to-end speedup on commercial AI accelerators.

02

Optimal hardware configuration yields 1.25x performance per cost improvement.

03

Mitigates memory bottlenecks and communication overheads in LLM serving.

Abstract

Large language models (LLMs) are typically served from clusters of GPUs/NPUs that consist of large number of devices. Unfortunately, communication between these devices incurs significant overhead, increasing the inference latency and cost while limiting the scalability. Prior work addressed this issue by overlapping communication with compute, but has severe limitations due to the data dependencies between these operations. In this paper, we propose PRESERVE, a novel framework that prefetches model weights and KV-cache from off-chip HBM memory to the on-chip cache of AI accelerators during the communication operations, which offers various advantages and performance improvements compared to prior methods. Through extensive experiments conducted on commercial AI accelerators, we demonstrate up to 1.6x end-to-end speedup on state-of-the-art, open-source LLMs. Additionally, we perform a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsService-Oriented Architecture and Web Services · Network Packet Processing and Optimization · Algorithms and Data Compression