TL;DR
This paper introduces DELI, a system that enhances distributed deep learning performance in cloud environments by using caching and pre-fetching to efficiently access data stored in cloud storage buckets, reducing wait times and costs.
Contribution
The paper presents DELI, a novel approach that leverages classical techniques to optimize data access from cloud storage during distributed training, addressing bandwidth limitations.
Findings
Data loading wait time reduced by up to 93.5%.
Achieves performance comparable to local disk data loading.
Potential to lower training costs significantly.
Abstract
Cloud computing provides a powerful yet low-cost environment for distributed deep learning workloads. However, training complex deep learning models often requires accessing large amounts of data, which can easily exceed the capacity of local disks. Prior research often overlooks this training data problem by implicitly assuming that data is available locally or via low latency network-based data storage. Such implicit assumptions often do not hold in a cloud-based training environment, where deep learning practitioners create and tear down dedicated GPU clusters on demand, or do not have the luxury of local storage, such as in serverless workloads. In this work, we investigate the performance of distributed training that leverages training data residing entirely inside cloud storage buckets. These buckets promise low storage costs, but come with inherent bandwidth limitations that make…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
