Performance comparison of Dask and Apache Spark on HPC systems for   Neuroimaging

Mathieu Dugr\'e; Val\'erie Hayot-Sasson; Tristan Glatard

arXiv:2406.01409·cs.DC·June 4, 2024

Performance comparison of Dask and Apache Spark on HPC systems for Neuroimaging

Mathieu Dugr\'e, Val\'erie Hayot-Sasson, Tristan Glatard

PDF

2 Repos

TL;DR

This study compares the performance of Dask and Apache Spark on HPC systems for neuroimaging data processing, highlighting their similar performance but differences in memory usage and data transfer bottlenecks.

Contribution

It provides a systematic benchmark of Dask and Spark for neuroimaging pipelines on HPC, offering practical insights for selecting Big Data tools in scientific research.

Findings

01

Performance of Dask and Spark is comparable for data-intensive neuroimaging tasks.

02

Spark requires more memory, potentially affecting runtime.

03

Data transfer time is the main limiting factor in performance.

Abstract

The general increase in data size and data sharing motivates the adoption of Big Data strategies in several scientific disciplines. However, while several options are available, no particular guidelines exist for selecting a Big Data engine. In this paper, we compare the runtime performance of two popular Big Data engines with Python APIs, Apache Spark, and Dask, in processing neuroimaging pipelines. Our experiments use three synthetic \HL{neuroimaging} applications to process the \SI{606}{\gibi\byte} BigBrain image and an actual pipeline to process data from thousands of anatomical images. We benchmark these applications on a dedicated HPC cluster running the Lustre file system while using varying combinations of the number of nodes, file size, and task duration. Our results show that although there are slight differences between Dask and Spark, the performance of the engines is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.