Fries: Fast and Consistent Runtime Reconfiguration in Dataflow Systems with Transactional Guarantees (Extended Version)
Zuozhi Wang, Shengquan Ni, Avinash Kumar, Chen Li

TL;DR
Fries introduces a fast, consistent runtime reconfiguration method for dataflow systems that minimizes delay and supports fault tolerance, leveraging fast control messages for on-the-fly updates without system downtime.
Contribution
The paper presents Fries, a novel reconfiguration scheduler that ensures consistency and low latency in dataflow systems using control messages, improving over epoch-based methods.
Findings
Fries achieves faster reconfiguration times compared to epoch-based schedulers.
The technique maintains consistency during reconfiguration in various dataflow classes.
Experimental results demonstrate improved performance and fault tolerance in cluster environments.
Abstract
A computing job in a big data system can take a long time to run, especially for pipelined executions on data streams. Developers often need to change the computing logic of the job such as fixing a loophole in an operator or changing the machine learning model in an operator with a cheaper model to handle a sudden increase of the data-ingestion rate. Recently many systems have started supporting runtime reconfigurations to allow this type of change on the fly without killing and restarting the execution. While the delay in reconfiguration is critical to performance, existing systems use epochs to do runtime reconfigurations, which can cause a long delay. In this paper we develop a new technique called Fries that leverages the emerging availability of fast control messages in many systems, since these messages can be sent without being blocked by data messages. We formally define…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Distributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques
