Towards Resiliency in Large Language Model Serving with KevlarFlow
Shangshu Qian, Kipling Liu, P. C. Sruthi, Lin Tan, Yongle Zhang

TL;DR
KevlarFlow is a novel fault-tolerant LLM serving architecture that significantly improves recovery time and latency during hardware failures by using decoupled initialization, traffic rerouting, and cache replication.
Contribution
The paper introduces KevlarFlow, a new architecture that enhances LLM serving resilience with innovative fault-tolerance mechanisms and minimal runtime overhead.
Findings
Reduces mean-time-to-recovery (MTTR) by 20x.
Improves average latency by 3.1x during failures.
Achieves 378.9x faster time-to-first-token (TTFT).
Abstract
Large Language Model (LLM) serving systems remain fundamentally fragile, where frequent hardware faults in hyperscale clusters trigger disproportionate service outages in the software stack. Current recovery mechanisms are prohibitively slow, often requiring up to 10 minutes to reinitialize resources and reload massive model weights. We introduce KevlarFlow, a fault tolerant serving architecture designed to bridge the gap between hardware unreliability and service availability. KevlarFlow leverages 1) decoupled model parallelism initialization, 2) dynamic traffic rerouting, and 3) background KV cache replication to maintain high throughput during partial failures. Our evaluation demonstrates that KevlarFlow reduces mean-time-to-recovery (MTTR) by 20x and, under failure conditions, improves average latency by 3.1x, 99th percentile (p99) latency by 2.8x, average time-to-first-token (TTFT)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Distributed systems and fault tolerance
