Quantifying and Improving Performance of Distributed Deep Learning with   Cloud Storage

Nicholas Krichevsky (1); Renee St Louis (1); Tian Guo (1) ((1); Worcester Polytechnic Institute)

arXiv:2108.06322·cs.DC·November 24, 2021

Quantifying and Improving Performance of Distributed Deep Learning with Cloud Storage

Nicholas Krichevsky (1), Renee St Louis (1), Tian Guo (1) ((1), Worcester Polytechnic Institute)

PDF

1 Repo

TL;DR

This paper introduces DELI, a system that enhances distributed deep learning performance in cloud environments by using caching and pre-fetching to efficiently access data stored in cloud storage buckets, reducing wait times and costs.

Contribution

The paper presents DELI, a novel approach that leverages classical techniques to optimize data access from cloud storage during distributed training, addressing bandwidth limitations.

Findings

01

Data loading wait time reduced by up to 93.5%.

02

Achieves performance comparable to local disk data loading.

03

Potential to lower training costs significantly.

Abstract

Cloud computing provides a powerful yet low-cost environment for distributed deep learning workloads. However, training complex deep learning models often requires accessing large amounts of data, which can easily exceed the capacity of local disks. Prior research often overlooks this training data problem by implicitly assuming that data is available locally or via low latency network-based data storage. Such implicit assumptions often do not hold in a cloud-based training environment, where deep learning practitioners create and tear down dedicated GPU clusters on demand, or do not have the luxury of local storage, such as in serverless workloads. In this work, we investigate the performance of distributed training that leverages training data residing entirely inside cloud storage buckets. These buckets promise low storage costs, but come with inherent bandwidth limitations that make…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cake-lab/deli
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.