Building Near-Real-Time Processing Pipelines with the Spark-MPI Platform

Nikolay Malitsky; Aashish Chaudhary; Sebastien Jourdain; Matt Cowan,; Patrick O'Leary; Marcus Hanwell; and Kerstin Kleese Van Dam

arXiv:1805.04886·cs.DC·May 15, 2018

Building Near-Real-Time Processing Pipelines with the Spark-MPI Platform

Nikolay Malitsky, Aashish Chaudhary, Sebastien Jourdain, Matt Cowan,, Patrick O'Leary, Marcus Hanwell, and Kerstin Kleese Van Dam

PDF

TL;DR

This paper presents the Spark-MPI platform that integrates Spark's data processing capabilities with MPI's high-performance computing to enable near-real-time data analysis in scientific experiments.

Contribution

It introduces an integrated platform combining Spark and MPI for real-time processing pipelines, demonstrated through ptychographic and tomographic reconstruction applications.

Findings

01

Enhanced data processing speed for real-time scientific analysis

02

Successful implementation of integrated Spark-MPI pipelines

03

Improved scalability and flexibility in experimental data workflows

Abstract

Advances in detectors and computational technologies provide new opportunities for applied research and the fundamental sciences. Concurrently, dramatic increases in the three Vs (Volume, Velocity, and Variety) of experimental data and the scale of computational tasks produced the demand for new real-time processing systems at experimental facilities. Recently, this demand was addressed by the Spark-MPI approach connecting the Spark data-intensive platform with the MPI high-performance framework. In contrast with existing data management and analytics systems, Spark introduced a new middleware based on resilient distributed datasets (RDDs), which decoupled various data sources from high-level processing algorithms. The RDD middleware significantly advanced the scope of data-intensive applications, spreading from SQL queries to machine learning to graph processing. Spark-MPI further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.