ds-array: A Distributed Data Structure for Large Scale Machine Learning
Javier \'Alvarez Cid-Fuentes, Pol \'Alvarez, Salvi Sol\`a, Kuninori, Ishii, Rafael K. Morizawa, Rosa M. Badia

TL;DR
This paper introduces ds-array, a new distributed data structure for dislib that significantly improves performance, scalability, and usability for large-scale scientific machine learning tasks on HPC clusters.
Contribution
The paper presents ds-array, a novel distributed data structure that overcomes dislib's limitations by offering a NumPy-like API and enhanced efficiency.
Findings
Performance improvements of up to 100x over Datasets
Enhanced scalability and usability in scientific data analysis
Reduced computational complexity of key operations
Abstract
Machine learning has proved to be a useful tool for extracting knowledge from scientific data in numerous research fields, including astrophysics, genomics, and molecular dynamics. Often, data sets from these research areas need to be processed in distributed platforms due to their magnitude. This can be done using one of the various distributed machine learning libraries available. One of these libraries is dislib, a distributed machine learning library for Python especially designed to process large scale data sets on HPC clusters, which makes dislib an ideal candidate for analyzing scientific data. However, dislib's main distributed data structure, called Dataset, has some limitations, including poor performance in certain operations and low flexibility and usability. In this paper, we propose a novel distributed data structure for dislib, called ds-array, that addresses dislib's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Algorithms and Data Compression · Advanced Data Storage Technologies
