FuxiShuffle: An Adaptive and Resilient Shuffle Service for Distributed Data Processing on Alibaba Cloud

Yuhao Lin; Zhipeng Tang; Jiayan Tong; Junqing Xiao; Bin Lu; Yuhang Li; Chao Li; Zhiguo Zhang; Junhua Wang; Hao Luo; James Cheng; Chuang Hu; Jiawei Jiang; Xiao Yan

arXiv:2602.22580·cs.DC·February 27, 2026

FuxiShuffle: An Adaptive and Resilient Shuffle Service for Distributed Data Processing on Alibaba Cloud

Yuhao Lin, Zhipeng Tang, Jiayan Tong, Junqing Xiao, Bin Lu, Yuhang Li, Chao Li, Zhiguo Zhang, Junhua Wang, Hao Luo, James Cheng, Chuang Hu, Jiawei Jiang, Xiao Yan

PDF

Open Access

TL;DR

FuxiShuffle is a novel adaptive and resilient shuffle service designed for large-scale distributed data processing, dynamically optimizing performance and fault tolerance in Alibaba Cloud's MaxCompute platform.

Contribution

It introduces a dynamic shuffle mode selection, progress-aware scheduling, and active failure resilience mechanisms tailored for ultra-large, highly dynamic environments.

Findings

01

Reduces end-to-end job completion time

02

Lowers aggregate resource consumption

03

Improves adaptability and failure resilience

Abstract

Shuffle exchanges intermediate results between upstream and downstream operators in distributed data processing and is usually the bottleneck due to factors such as small random I/Os and network contention. Several systems have been designed to improve shuffle efficiency, but from our experiences of running ultra-large clusters at Alibaba Cloud MaxCompute platform, we observe that they can not adapt to highly dynamic job characteristics and cluster resource conditions, and their fault tolerance mechanisms are passive and inefficient when failures are inevitable. To tackle their limitations, we design and implement FuxiShuffle as a general data shuffle service for the ultra-large production environment of MaxCompute, featuring good adaptability and efficient failure resilience. Specifically, to achieve good adaptability, FuxiShuffle dynamically selects the shuffle mode based on runtime…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Distributed and Parallel Computing Systems · Distributed systems and fault tolerance