Stochastic Gradient Descent without Full Data Shuffle
Lijie Xu, Shuang Qiu, Binhang Yuan, Jiawei Jiang, Cedric Renggli,, Shaoduo Gan, Kaan Kara, Guoliang Li, Ji Liu, Wentao Wu, Jieping Ye, Ce Zhang

TL;DR
This paper introduces CorgiPile, a hierarchical data shuffling strategy for SGD that reduces data shuffling overhead while maintaining convergence rates, significantly improving training speed in ML systems.
Contribution
The paper proposes CorgiPile, a novel data shuffling method that avoids full data shuffles, with theoretical analysis and practical integration into PyTorch and PostgreSQL.
Findings
CorgiPile achieves comparable convergence to full shuffling.
CorgiPile accelerates deep learning training on ImageNet by 1.5X.
CorgiPile improves in-DB ML training speed by up to 12.8X.
Abstract
Stochastic gradient descent (SGD) is the cornerstone of modern machine learning (ML) systems. Despite its computational efficiency, SGD requires random data access that is inherently inefficient when implemented in systems that rely on block-addressable secondary storage such as HDD and SSD, e.g., TensorFlow/PyTorch and in-DB ML systems over large files. To address this impedance mismatch, various data shuffling strategies have been proposed to balance the convergence rate of SGD (which favors randomness) and its I/O performance (which favors sequential access). In this paper, we first conduct a systematic empirical study on existing data shuffling strategies, which reveals that all existing strategies have room for improvement -- they all suffer in terms of I/O performance or convergence rate. With this in mind, we propose a simple but novel hierarchical data shuffling strategy,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Brain Tumor Detection and Classification
MethodsNon Maximum Suppression · 1x1 Convolution · Convolution · SSD · Stochastic Gradient Descent
