Towards Resiliency in Large Language Model Serving with KevlarFlow

Shangshu Qian; Kipling Liu; P. C. Sruthi; Lin Tan; Yongle Zhang

arXiv:2601.22438·cs.DC·February 2, 2026

Towards Resiliency in Large Language Model Serving with KevlarFlow

Shangshu Qian, Kipling Liu, P. C. Sruthi, Lin Tan, Yongle Zhang

PDF

Open Access

TL;DR

KevlarFlow is a novel fault-tolerant LLM serving architecture that significantly improves recovery time and latency during hardware failures by using decoupled initialization, traffic rerouting, and cache replication.

Contribution

The paper introduces KevlarFlow, a new architecture that enhances LLM serving resilience with innovative fault-tolerance mechanisms and minimal runtime overhead.

Findings

01

Reduces mean-time-to-recovery (MTTR) by 20x.

02

Improves average latency by 3.1x during failures.

03

Achieves 378.9x faster time-to-first-token (TTFT).

Abstract

Large Language Model (LLM) serving systems remain fundamentally fragile, where frequent hardware faults in hyperscale clusters trigger disproportionate service outages in the software stack. Current recovery mechanisms are prohibitively slow, often requiring up to 10 minutes to reinitialize resources and reload massive model weights. We introduce KevlarFlow, a fault tolerant serving architecture designed to bridge the gap between hardware unreliability and service availability. KevlarFlow leverages 1) decoupled model parallelism initialization, 2) dynamic traffic rerouting, and 3) background KV cache replication to maintain high throughput during partial failures. Our evaluation demonstrates that KevlarFlow reduces mean-time-to-recovery (MTTR) by 20x and, under failure conditions, improves average latency by 3.1x, 99th percentile (p99) latency by 2.8x, average time-to-first-token (TTFT)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Distributed systems and fault tolerance