TL;DR
This study compares the performance of Dask and Apache Spark on HPC systems for neuroimaging data processing, highlighting their similar performance but differences in memory usage and data transfer bottlenecks.
Contribution
It provides a systematic benchmark of Dask and Spark for neuroimaging pipelines on HPC, offering practical insights for selecting Big Data tools in scientific research.
Findings
Performance of Dask and Spark is comparable for data-intensive neuroimaging tasks.
Spark requires more memory, potentially affecting runtime.
Data transfer time is the main limiting factor in performance.
Abstract
The general increase in data size and data sharing motivates the adoption of Big Data strategies in several scientific disciplines. However, while several options are available, no particular guidelines exist for selecting a Big Data engine. In this paper, we compare the runtime performance of two popular Big Data engines with Python APIs, Apache Spark, and Dask, in processing neuroimaging pipelines. Our experiments use three synthetic \HL{neuroimaging} applications to process the \SI{606}{\gibi\byte} BigBrain image and an actual pipeline to process data from thousands of anatomical images. We benchmark these applications on a dedicated HPC cluster running the Lustre file system while using varying combinations of the number of nodes, file size, and task duration. Our results show that although there are slight differences between Dask and Spark, the performance of the engines is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
