Bleach: A Distributed Stream Data Cleaning System
Yongchao Tian, Pietro Michiardi, Marko Vukolic

TL;DR
Bleach is a distributed system designed for real-time, rule-based cleaning of streaming data, achieving high throughput, low latency, and adaptability to rule changes through efficient data structures and incremental algorithms.
Contribution
It introduces Bleach, a novel distributed stream data cleaning system that supports real-time violation detection, data repair, and dynamic rule updates with improved performance.
Findings
High throughput and low latency in data cleaning
Effective handling of rule dynamics and unbounded data streams
Superior performance compared to micro-batch baseline
Abstract
In this paper we address the problem of rule-based stream data cleaning, which sets stringent requirements on latency, rule dynamics and ability to cope with the unbounded nature of data streams. We design a system, called Bleach, which achieves real-time violation detection and data repair on a dirty data stream. Bleach relies on efficient, compact and distributed data structures to maintain the necessary state to repair data, using an incremental version of the equivalence class algorithm. Additionally, it supports rule dynamics and uses a "cumulative" sliding window operation to improve cleaning accuracy. We evaluate a prototype of Bleach using a TPC-DS derived dirty data stream and observe its high throughput, low latency and high cleaning accuracy, even with rule dynamics. Experimental results indicate superior performance of Bleach compared to a baseline system built on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Data Stream Mining Techniques
