TL;DR
This paper introduces a benchmarking framework for distributed stream processing systems, evaluating throughput and latency of windowed operations on real-world workloads, providing detailed performance insights into Apache Storm, Spark, and Flink.
Contribution
It defines latency and throughput for stateful operators, separates system and driver for realistic testing, and creates the first comprehensive benchmarking framework for streaming systems.
Findings
Flink outperforms in throughput and latency for windowed operations.
System performance varies significantly across different workloads.
Benchmarking framework reveals each system's unique strengths and limitations.
Abstract
The need for scalable and efficient stream analysis has led to the development of many open-source streaming data processing systems (SDPSs) with highly diverging capabilities and performance characteristics. While first initiatives try to compare the systems for simple workloads, there is a clear gap of detailed analyses of the systems' performance characteristics. In this paper, we propose a framework for benchmarking distributed stream processing engines. We use our suite to evaluate the performance of three widely used SDPSs in detail, namely Apache Storm, Apache Spark, and Apache Flink. Our evaluation focuses in particular on measuring the throughput and latency of windowed operations, which are the basic type of operations in stream analytics. For this benchmark, we design workloads based on real-life, industrial use-cases inspired by the online gaming industry. The contribution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
