Clairvoyant Prefetching for Distributed Machine Learning I/O
Nikoli Dryden, Roman B\"ohringer, Tal Ben-Nun, Torsten Hoefler

TL;DR
This paper introduces NoPFS, a middleware that uses clairvoyance to predict data access patterns in distributed machine learning, significantly reducing I/O bottlenecks and accelerating training times.
Contribution
NoPFS is a novel I/O middleware that leverages seed-based access pattern prediction and adaptive caching to optimize data ingestion in distributed ML training.
Findings
Reduces I/O times by up to 5.4x on large datasets.
Improves overall training speed and efficiency.
Adapts to different datasets and storage hierarchies.
Abstract
I/O is emerging as a major bottleneck for machine learning training, especially in distributed environments. Indeed, at large scale, I/O takes as much as 85% of training time. Addressing this I/O bottleneck necessitates careful optimization, as optimal data ingestion pipelines differ between systems, and require a delicate balance between access to local storage, external filesystems, and remote nodes. We introduce NoPFS, a machine learning I/O middleware, which provides a scalable, flexible, and easy-to-use solution to the I/O bottleneck. NoPFS uses clairvoyance: Given the seed generating the random access pattern for training with SGD, it can exactly predict when and where a sample will be accessed. We combine this with an analysis of access patterns and a performance model to provide distributed caching policies that adapt to different datasets and storage hierarchies. NoPFS reduces…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsStochastic Gradient Descent
