# A performance comparison of Dask and Apache Spark for data-intensive   neuroimaging pipelines

**Authors:** Mathieu Dugr\'e, Val\'erie Hayot-Sasson, Tristan Glatard

arXiv: 1907.13030 · 2019-10-08

## TL;DR

This study compares the performance of Dask and Apache Spark in processing neuroimaging data pipelines, highlighting that both perform similarly but data transfer is a major bottleneck.

## Contribution

It provides a systematic benchmark of Dask and Spark for neuroimaging workflows, including synthetic and real data, on a cloud-based cluster.

## Key findings

- Both engines perform comparably in neuroimaging tasks.
- Data transfer is the primary bottleneck in performance.
- Dask's performance may be limited by Python's GIL depending on tasks.

## Abstract

In the past few years, neuroimaging has entered the Big Data era due to the joint increase in image resolution, data sharing, and study sizes. However, no particular Big Data engines have emerged in this field, and several alternatives remain available. We compare two popular Big Data engines with Python APIs, Apache Spark and Dask, for their runtime performance in processing neuroimaging pipelines. Our evaluation uses two synthetic pipelines processing the 81GB BigBrain image, and a real pipeline processing anatomical data from more than 1,000 subjects. We benchmark these pipelines using various combinations of task durations, data sizes, and numbers of workers, deployed on an 8-node (8 cores ea.) compute cluster in Compute Canada's Arbutus cloud. We evaluate PySpark's RDD API against Dask's Bag, Delayed and Futures. Results show that despite slight differences between Spark and Dask, both engines perform comparably. However, Dask pipelines risk being limited by Python's GIL depending on task type and cluster configuration. In all cases, the major limiting factor was data transfer. While either engine is suitable for neuroimaging pipelines, more effort needs to be placed in reducing data transfer time.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1907.13030/full.md

## Figures

47 figures with captions in the complete paper: https://tomesphere.com/paper/1907.13030/full.md

## References

16 references — full list in the complete paper: https://tomesphere.com/paper/1907.13030/full.md

---
Source: https://tomesphere.com/paper/1907.13030