Distributed Streaming Analytics on Large-scale Oceanographic Data using Apache Spark
Janak Dahal, Elias Ioup, Shaikh Arifuzzaman, Mahdi Abdelguerfi

TL;DR
This paper evaluates Apache Spark's streaming capabilities for large-scale oceanographic data, demonstrating its scalability, fault tolerance, and effectiveness in real-time geo-temporal data analysis and visualization.
Contribution
It provides a comprehensive assessment of Spark Streaming for large-scale geo-temporal data, including latency, scalability, fault tolerance, and a full-stack application for data processing and visualization.
Findings
Spark Streaming achieves low latency in large-scale data processing.
The system scales effectively with node addition/removal.
Fault tolerance ensures job completion despite node failures.
Abstract
Real-world data from diverse domains require real-time scalable analysis. Large-scale data processing frameworks or engines such as Hadoop fall short when results are needed on-the-fly. Apache Spark's streaming library is increasingly becoming a popular choice as it can stream and analyze a significant amount of data. In this paper, we analyze large-scale geo-temporal data collected from the USGODAE (United States Global Ocean Data Assimilation Experiment) data catalog, and showcase and assess the ability of Spark stream processing. We measure the latency of streaming and monitor scalability by adding and removing nodes in the middle of a streaming job. We also verify the fault tolerance by stopping nodes in the middle of a job and making sure that the job is rescheduled and completed on other nodes. We design a full-stack application that automates data collection, data processing and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Cloud Computing and Resource Management · Distributed and Parallel Computing Systems
