D\'ej\`aVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM   Serving

Foteini Strati; Sara Mcallister; Amar Phanishayee; Jakub Tarnawski,; Ana Klimovic

arXiv:2403.01876·cs.DC·March 5, 2024·1 cites

D\'ej\`aVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving

Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski,, Ana Klimovic

PDF

Open Access

TL;DR

D'je9Vu introduces a KV-cache streaming system that enhances distributed LLM serving by reducing latency, optimizing GPU memory, and improving fault-tolerance through innovative techniques like prompt-token disaggregation, microbatch swapping, and state replication.

Contribution

The paper presents D'je9VuLib, a versatile library that addresses pipeline bubbles, memory overprovisioning, and recovery delays in distributed LLM serving with novel streaming and fault-tolerance methods.

Findings

01

Significant reduction in pipeline latency.

02

Improved GPU memory utilization.

03

Enhanced fault-tolerance and recovery speed.

Abstract

Distributed LLM serving is costly and often underutilizes hardware accelerators due to three key challenges: bubbles in pipeline-parallel deployments caused by the bimodal latency of prompt and token processing, GPU memory overprovisioning, and long recovery times in case of failures. In this paper, we propose D\'ej\`aVu, a system to address all these challenges using a versatile and efficient KV cache streaming library (D\'ej\`aVuLib). Using D\'ej\`aVuLib, we propose and implement efficient prompt-token disaggregation to reduce pipeline bubbles, microbatch swapping for efficient GPU memory management, and state replication for fault-tolerance. We highlight the efficacy of these solutions on a range of large models across cloud deployments.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Caching and Content Delivery · Distributed systems and fault tolerance