PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
Ahmet Caner Y\"uz\"ug\"uler, Jiawei Zhuang, Lukas Cavigelli

TL;DR
PRESERVE is a framework that prefetches model weights and KV-cache into on-chip memory during communication, significantly reducing inference latency and cost in distributed large language model serving.
Contribution
It introduces a novel prefetching approach that overlaps communication with computation, improving scalability and performance of distributed LLM inference systems.
Findings
Up to 1.6x end-to-end speedup on commercial AI accelerators.
Optimal hardware configuration yields 1.25x performance per cost improvement.
Mitigates memory bottlenecks and communication overheads in LLM serving.
Abstract
Large language models (LLMs) are typically served from clusters of GPUs/NPUs that consist of large number of devices. Unfortunately, communication between these devices incurs significant overhead, increasing the inference latency and cost while limiting the scalability. Prior work addressed this issue by overlapping communication with compute, but has severe limitations due to the data dependencies between these operations. In this paper, we propose PRESERVE, a novel framework that prefetches model weights and KV-cache from off-chip HBM memory to the on-chip cache of AI accelerators during the communication operations, which offers various advantages and performance improvements compared to prior methods. Through extensive experiments conducted on commercial AI accelerators, we demonstrate up to 1.6x end-to-end speedup on state-of-the-art, open-source LLMs. Additionally, we perform a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsService-Oriented Architecture and Web Services · Network Packet Processing and Optimization · Algorithms and Data Compression
