D\'ej\`aVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving
Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski,, Ana Klimovic

TL;DR
D'je9Vu introduces a KV-cache streaming system that enhances distributed LLM serving by reducing latency, optimizing GPU memory, and improving fault-tolerance through innovative techniques like prompt-token disaggregation, microbatch swapping, and state replication.
Contribution
The paper presents D'je9VuLib, a versatile library that addresses pipeline bubbles, memory overprovisioning, and recovery delays in distributed LLM serving with novel streaming and fault-tolerance methods.
Findings
Significant reduction in pipeline latency.
Improved GPU memory utilization.
Enhanced fault-tolerance and recovery speed.
Abstract
Distributed LLM serving is costly and often underutilizes hardware accelerators due to three key challenges: bubbles in pipeline-parallel deployments caused by the bimodal latency of prompt and token processing, GPU memory overprovisioning, and long recovery times in case of failures. In this paper, we propose D\'ej\`aVu, a system to address all these challenges using a versatile and efficient KV cache streaming library (D\'ej\`aVuLib). Using D\'ej\`aVuLib, we propose and implement efficient prompt-token disaggregation to reduce pipeline bubbles, microbatch swapping for efficient GPU memory management, and state replication for fault-tolerance. We highlight the efficacy of these solutions on a range of large models across cloud deployments.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Caching and Content Delivery · Distributed systems and fault tolerance
