Building Near-Real-Time Processing Pipelines with the Spark-MPI Platform
Nikolay Malitsky, Aashish Chaudhary, Sebastien Jourdain, Matt Cowan,, Patrick O'Leary, Marcus Hanwell, and Kerstin Kleese Van Dam

TL;DR
This paper presents the Spark-MPI platform that integrates Spark's data processing capabilities with MPI's high-performance computing to enable near-real-time data analysis in scientific experiments.
Contribution
It introduces an integrated platform combining Spark and MPI for real-time processing pipelines, demonstrated through ptychographic and tomographic reconstruction applications.
Findings
Enhanced data processing speed for real-time scientific analysis
Successful implementation of integrated Spark-MPI pipelines
Improved scalability and flexibility in experimental data workflows
Abstract
Advances in detectors and computational technologies provide new opportunities for applied research and the fundamental sciences. Concurrently, dramatic increases in the three Vs (Volume, Velocity, and Variety) of experimental data and the scale of computational tasks produced the demand for new real-time processing systems at experimental facilities. Recently, this demand was addressed by the Spark-MPI approach connecting the Spark data-intensive platform with the MPI high-performance framework. In contrast with existing data management and analytics systems, Spark introduced a new middleware based on resilient distributed datasets (RDDs), which decoupled various data sources from high-level processing algorithms. The RDD middleware significantly advanced the scope of data-intensive applications, spreading from SQL queries to machine learning to graph processing. Spark-MPI further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
