Approximate Stream Analytics in Apache Flink and Apache Spark Streaming
Do Le Quoc, Ruichuan Chen, Pramod Bhatotia, Christof Fetze, Volker, Hilt, Thorsten Strufe

TL;DR
This paper introduces StreamApprox, a system for approximate stream analytics in Apache Flink and Spark Streaming, using an online stratified reservoir sampling algorithm to improve efficiency while maintaining accuracy.
Contribution
The paper presents a generic sampling algorithm for stream processing systems, enabling approximate analytics with error bounds in both batched and pipelined frameworks.
Findings
Achieves 1.15x-3x speedup over native systems.
Maintains accuracy comparable to exact processing.
Effective in real-world case studies.
Abstract
Approximate computing aims for efficient execution of workflows where an approximate output is sufficient instead of the exact output. The idea behind approximate computing is to compute over a representative sample instead of the entire input dataset. Thus, approximate computing - based on the chosen sample size - can make a systematic trade-off between the output accuracy and computation efficiency. Unfortunately, the state-of-the-art systems for approximate computing primarily target batch analytics, where the input data remains unchanged during the course of sampling. Thus, they are not well-suited for stream analytics. This motivated the design of StreamApprox - a stream analytics system for approximate computing. To realize this idea, we designed an online stratified reservoir sampling algorithm to produce approximate output with rigorous error bounds. Importantly, our proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Low-power high-performance VLSI design · Neural Networks and Reservoir Computing
