Optimizing High-Throughput Distributed Data Pipelines for Reproducible Deep Learning at Scale
Kashish Mittal, Di Yu, Roozbeh Ketabi, Arushi Arora, Brendon Lapp, Peng Zhang

TL;DR
This paper presents an optimized distributed data loading architecture that significantly improves GPU utilization, reduces training time, and enhances reproducibility for large-scale deep learning models.
Contribution
It introduces push-down worker transformations, local caching, and race condition elimination techniques to optimize data pipelines at scale.
Findings
Achieved a 6x speedup in training time
Increased GPU utilization to over 60%
Reduced run-to-run variance for reproducibility
Abstract
Training massive-scale deep learning models on datasets spanning tens of terabytes presents critical challenges in hardware utilization and training reproducibility. In this paper, we identify and resolve profound data-loading bottlenecks within distributed GPU training pipelines using the Petastorm data loader and Apache Parquet datasets. Through systematic profiling, we demonstrate that network I/O and CPU-bound data transformations (e.g., PyArrow to NumPy) constrain GPU utilization to as low as 10-15%. To address this, we propose an optimized architecture that features push-down worker-level transformations coupled with local-disk caching via Fanout-Cache, minimizing redundant I/O and CPU overhead across training epochs. Furthermore, we eliminate race conditions in multi-worker shared queues by implementing dedicated round-robin ventilator and result queues, alongside modernized RNG…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
