Failure Transparency in Stateful Dataflow Systems (Technical Report)
Aleksey Veresov (1), Jonas Spenger (1), Paris Carbone (1, 2),, Philipp Haller (1) ((1) KTH Royal Institute of Technology, (2) RISE Research, Institutes of Sweden)

TL;DR
This paper formally proves failure transparency in Apache Flink by modeling its failure handling mechanism, defining a novel observational explainability concept, and demonstrating that the failure-free model accurately abstracts failure-related behaviors.
Contribution
It introduces a formal proof of failure transparency for Apache Flink, modeling its failure handling and establishing correctness through observational semantics.
Findings
Formal proof of failure transparency for Apache Flink
Modeling of failure handling using small-step operational semantics
Establishment of liveness under fair execution assumptions
Abstract
Failure transparency enables users to reason about distributed systems at a higher level of abstraction, where complex failure-handling logic is hidden. This is especially true for stateful dataflow systems, which are the backbone of many cloud applications. In particular, this paper focuses on proving failure transparency in Apache Flink, a popular stateful dataflow system. Even though failure transparency is a critical aspect of Apache Flink, to date it has not been formally proven. Showing that the failure transparency mechanism is correct, however, is challenging due to the complexity of the mechanism itself. Nevertheless, this complexity can be effectively hidden behind a failure transparent programming interface. To show that Apache Flink is failure transparent, we model it in small-step operational semantics. Next, we provide a novel definition of failure transparency based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
