Analyzing and Mitigating Data Stalls in DNN Training
Jayashree Mohan, Amar Phanishayee, Ashish Raniwala, Vijay Chidambaram

TL;DR
This paper analyzes how data pipeline bottlenecks impact DNN training times and introduces techniques to reduce data stalls, significantly improving training efficiency across various models and hardware configurations.
Contribution
It provides the first comprehensive analysis of data stalls in DNN training, develops a tool for precise measurement, and proposes effective mitigation techniques implemented in a new data-loading library.
Findings
Data stalls often dominate training time in DNNs.
The DS-Analyzer tool accurately measures data stalls.
CoorDL reduces training time by up to 5x compared to existing methods.
Abstract
Training Deep Neural Networks (DNNs) is resource-intensive and time-consuming. While prior research has explored many different ways of reducing DNN training time, the impact of input data pipeline, i.e., fetching raw data items from storage and performing data pre-processing in memory, has been relatively unexplored. This paper makes the following contributions: (1) We present the first comprehensive analysis of how the input data pipeline affects the training time of widely-used computer vision and audio Deep Neural Networks (DNNs), that typically involve complex data preprocessing. We analyze nine different models across three tasks and four datasets while varying factors such as the amount of memory, number of CPU threads, storage device, GPU generation etc on servers that are a part of a large production cluster at Microsoft. We find that in many cases, DNN training time is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
