TL;DR
This paper compares Spark, Dask, RADICAL-Pilot, and MPI for analyzing large-scale molecular dynamics simulation data, focusing on their suitability, performance, and parallelization strategies for trajectory analysis tasks.
Contribution
It provides a comprehensive evaluation of task-parallel frameworks for MD data analysis, highlighting their capabilities and performance differences on HPC resources.
Findings
RADICAL-Pilot and MPI outperform Spark and Dask in certain tasks.
Framework choice significantly impacts analysis efficiency and scalability.
Parallelization approaches vary in effectiveness depending on the algorithm.
Abstract
Different parallel frameworks for implementing data analysis applications have been proposed by the HPC and Big Data communities. In this paper, we investigate three task-parallel frameworks: Spark, Dask and RADICAL-Pilot with respect to their ability to support data analytics on HPC resources and compare them with MPI. We investigate the data analysis requirements of Molecular Dynamics (MD) simulations which are significant consumers of supercomputing cycles, producing immense amounts of data. A typical large-scale MD simulation of a physical system of O(100k) atoms over {\mu}secs can produce from O(10) GB to O(1000) GBs of data. We propose and evaluate different approaches for parallelization of a representative set of MD trajectory analysis algorithms, in particular the computation of path similarity and leaflet identification. We evaluate Spark, Dask and RADICAL-Pilot with respect…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
