AnchorTP: Resilient LLM Inference with State-Preserving Elastic Tensor Parallelism

Wendong Xu; Chujie Chen; He Xiao; Kuan Li; Jing Xiong; Chen Zhang; Wenyong Zhou; Chaofan Tao; Yang Bai; Bei Yu; Ngai Wong

arXiv:2511.11617·cs.DC·November 18, 2025

AnchorTP: Resilient LLM Inference with State-Preserving Elastic Tensor Parallelism

Wendong Xu, Chujie Chen, He Xiao, Kuan Li, Jing Xiong, Chen Zhang, Wenyong Zhou, Chaofan Tao, Yang Bai, Bei Yu, Ngai Wong

PDF

Open Access

TL;DR

AnchorTP is a resilient tensor parallelism framework for large language model inference that enables fast recovery from GPU failures with minimal downtime and data movement.

Contribution

It introduces a state-preserving elastic tensor parallelism framework compatible with MoE, featuring a bandwidth-aware planner and scheduler for rapid failure recovery.

Findings

01

Reduces Time to First Success by up to 11x

02

Decreases Time to Peak by up to 59%

03

Supports unequal-width partitioning and MoE compatibility

Abstract

Large Language Model (LLM) inference services demand exceptionally high availability and low latency, yet multi-GPU Tensor Parallelism (TP) makes them vulnerable to single-GPU failures. We present AnchorTP, a state-preserving elastic TP framework for fast recovery. It (i) enables Elastic Tensor Parallelism (ETP) with unequal-width partitioning over any number of GPUs and compatibility with Mixture-of-Experts (MoE), and (ii) preserves model parameters and KV caches in GPU memory via a daemon decoupled from the inference process. To minimize downtime, we propose a bandwidth-aware planner based on a Continuous Minimal Migration (CMM) algorithm that minimizes reload bytes under a byte-cost dominance assumption, and an execution scheduler that pipelines P2P transfers with reloads. These components jointly restore service quickly with minimal data movement and without changing service…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Parallel Computing and Optimization Techniques · Software System Performance and Reliability