Apache Spark Streaming, Kafka and HarmonicIO: A Performance Benchmark and Architecture Comparison for Enterprise and Scientific Computing
Ben Blamey, Andreas Hellander, Salman Toor

TL;DR
This paper benchmarks the throughput of Apache Spark Streaming and HarmonicIO for large message sizes and high CPU loads, revealing performance trade-offs and guiding framework selection for scientific and enterprise streaming applications.
Contribution
It provides a detailed performance comparison and analysis of Spark Streaming and HarmonicIO under various loads and message sizes, highlighting their strengths and limitations.
Findings
HarmonicIO offers more robust performance for large messages in the 1MB-10MB range.
Spark Streaming's rich features can lead to performance sensitivity with large message sizes.
Performance trade-offs depend on streaming source and load, guiding framework choice.
Abstract
This paper presents a benchmark of stream processing throughput comparing Apache Spark Streaming (under file-, TCP socket- and Kafka-based stream integration), with a prototype P2P stream processing framework, HarmonicIO. Maximum throughput for a spectrum of stream processing loads are measured, specifically, those with large message sizes (up to 10MB), and heavy CPU loads -- more typical of scientific computing use cases (such as microscopy), than enterprise contexts. A detailed exploration of the performance characteristics with these streaming sources, under varying loads, reveals an interplay of performance trade-offs, uncovering the boundaries of good performance for each framework and streaming source integration. We compare with theoretic bounds in each case. Based on these results, we suggest which frameworks and streaming sources are likely to offer good performance for a given…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Cloud Computing and Resource Management · Scientific Computing and Data Management
