Optimizing High-Throughput Distributed Data Pipelines for Reproducible Deep Learning at Scale

Kashish Mittal; Di Yu; Roozbeh Ketabi; Arushi Arora; Brendon Lapp; Peng Zhang

arXiv:2604.21275·cs.DC·April 24, 2026

Optimizing High-Throughput Distributed Data Pipelines for Reproducible Deep Learning at Scale

Kashish Mittal, Di Yu, Roozbeh Ketabi, Arushi Arora, Brendon Lapp, Peng Zhang

PDF

TL;DR

This paper presents an optimized distributed data loading architecture that significantly improves GPU utilization, reduces training time, and enhances reproducibility for large-scale deep learning models.

Contribution

It introduces push-down worker transformations, local caching, and race condition elimination techniques to optimize data pipelines at scale.

Findings

01

Achieved a 6x speedup in training time

02

Increased GPU utilization to over 60%

03

Reduced run-to-run variance for reproducibility

Abstract

Training massive-scale deep learning models on datasets spanning tens of terabytes presents critical challenges in hardware utilization and training reproducibility. In this paper, we identify and resolve profound data-loading bottlenecks within distributed GPU training pipelines using the Petastorm data loader and Apache Parquet datasets. Through systematic profiling, we demonstrate that network I/O and CPU-bound data transformations (e.g., PyArrow to NumPy) constrain GPU utilization to as low as 10-15%. To address this, we propose an optimized architecture that features push-down worker-level transformations coupled with local-disk caching via Fanout-Cache, minimizing redundant I/O and CPU overhead across training epochs. Furthermore, we eliminate race conditions in multi-worker shared queues by implementing dedicated round-robin ventilator and result queues, alongside modernized RNG…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.