FailSafe: High-performance Resilient Serving
Ziyi Xu, Zhiqiang Xie, Swapnil Gandhi, Christos Kozyrakis

TL;DR
FailSafe is a fault-tolerant tensor parallelism system for large language model inference that maintains high performance and load balancing despite GPU failures, using innovative techniques for memory management, request routing, and fault recovery.
Contribution
FailSafe introduces a set of techniques for resilient tensor parallelism that enable high-performance LLM serving under GPU failures, improving robustness and efficiency.
Findings
Up to 2x higher throughput compared to standard approaches
Two orders of magnitude lower recovery latency
Maintains high throughput with up to three GPU failures
Abstract
Tensor parallelism (TP) enables large language models (LLMs) to scale inference efficiently across multiple GPUs, but its tight coupling makes systems fragile: a single GPU failure can halt execution, trigger costly KVCache recomputation, and introduce long-term compute and memory imbalance. We present FailSafe, a fault-tolerant TP serving system that sustains high performance under irregular GPU availability. FailSafe introduces three techniques to balance computation and memory across GPUs: (1) Cyclic KVCache Placement for uniform memory utilization, (2) Hybrid Attention combining tensor- and data-parallel attention to eliminate stragglers, and (3) Fine-Grained Load-Aware Routing to dynamically balance requests. It further employs proactive KVCache backup and on-demand weight recovery to avoid expensive recomputation and redundant data transfers. We implement these techniques in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Software System Performance and Reliability · Distributed systems and fault tolerance
