Making MoE-based LLM Inference Resilient with Tarragon

Songyu Zhang; Aaron Tam; Myungjin Lee; Shixiong Qi; K. K. Ramakrishnan

arXiv:2601.01310·cs.DC·January 7, 2026

Making MoE-based LLM Inference Resilient with Tarragon

Songyu Zhang, Aaron Tam, Myungjin Lee, Shixiong Qi, K. K. Ramakrishnan

PDF

Open Access

TL;DR

Tarragon is a resilient MoE inference framework that isolates failures to individual workers, enabling continuous LLM inference with minimal disruption, by reconfiguring data paths and implementing self-healing mechanisms.

Contribution

Tarragon introduces a novel failure confinement and recovery approach for MoE-based LLM inference, significantly reducing failure-induced stalls and maintaining high performance.

Findings

01

Reduces failure-induced stalls by 160-213x

02

Maintains performance during failure-free operation

03

Enables continuous inference despite worker failures

Abstract

Mixture-of-Experts (MoE) models are increasingly used to serve LLMs at scale, but failures become common as deployment scale grows. Existing systems exhibit poor failure resilience: even a single worker failure triggers a coarse-grained, service-wide restart, discarding accumulated progress and halting the entire inference pipeline during recovery--an approach clearly ill-suited for latency-sensitive, LLM services. We present Tarragon, a resilient MoE inference framework that confines the failures impact to individual workers while allowing the rest of the pipeline to continue making forward progress. Tarragon exploits the natural separation between the attention and expert computation in MoE-based transformers, treating attention workers (AWs) and expert workers (EWs) as distinct failure domains. Tarragon introduces a reconfigurable datapath to mask failures by rerouting requests to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Distributed systems and fault tolerance · Parallel Computing and Optimization Techniques