StreamShield: A Production-Proven Resiliency Solution for Apache Flink at ByteDance
Yong Fang, Yuxing Han, Meng Wang, Yifan Zhang, Yue Ma, Chi Zhang

TL;DR
StreamShield is a comprehensive resiliency solution for Apache Flink at ByteDance, combining runtime optimization, fault-tolerance, hybrid replication, and high availability to ensure stability and rapid recovery in large-scale production environments.
Contribution
It introduces a production-proven, integrated resiliency framework for Apache Flink, addressing operational challenges and enhancing fault tolerance in large-scale deployments.
Findings
Improved fault recovery times in production clusters
Enhanced system stability under failure conditions
Effective deployment pipeline ensuring reliability
Abstract
Distributed Stream Processing Systems (DSPSs) form the backbone of real-time processing and analytics at ByteDance, where Apache Flink powers one of the largest production clusters worldwide. Ensuring resiliency, the ability to withstand and rapidly recover from failures, together with operational stability, which provides consistent and predictable performance under normal conditions, is essential for meeting strict Service Level Objectives (SLOs). However, achieving resiliency and stability in large-scale production environments remains challenging due to the cluster scale, business diversity, and significant operational overhead. In this work, we present StreamShield, a production-proven resiliency solution deployed in ByteDance's Flink clusters. Designed along complementary perspectives of the engine and cluster, StreamShield introduces key techniques to enhance resiliency, covering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Distributed systems and fault tolerance
