Corgi^2: A Hybrid Offline-Online Approach To Storage-Aware Data   Shuffling For SGD

Etay Livne; Gal Kaplun; Eran Malach; Shai Shalev-Schwatz

arXiv:2309.01640·cs.LG·September 6, 2023

Corgi^2: A Hybrid Offline-Online Approach To Storage-Aware Data Shuffling For SGD

Etay Livne, Gal Kaplun, Eran Malach, Shai Shalev-Schwatz

PDF

Open Access

TL;DR

This paper introduces a hybrid offline-online data shuffling method for SGD that balances efficiency and randomness, improving training on large cloud-stored datasets.

Contribution

It proposes a novel two-step shuffling strategy combining offline and online methods, enhancing data access efficiency while maintaining convergence performance.

Findings

01

Achieves similar convergence to fully random shuffling.

02

Reduces data access costs for large datasets.

03

Demonstrates practical benefits through experiments.

Abstract

When using Stochastic Gradient Descent (SGD) for training machine learning models, it is often crucial to provide the model with examples sampled at random from the dataset. However, for large datasets stored in the cloud, random access to individual examples is often costly and inefficient. A recent work \cite{corgi}, proposed an online shuffling algorithm called CorgiPile, which greatly improves efficiency of data access, at the cost some performance loss, which is particularly apparent for large datasets stored in homogeneous shards (e.g., video datasets). In this paper, we introduce a novel two-step partial data shuffling strategy for SGD which combines an offline iteration of the CorgiPile method with a subsequent online iteration. Our approach enjoys the best of both worlds: it performs similarly to SGD with random access (even for homogenous data) without compromising the data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Machine Learning and Data Classification · Stochastic Gradient Optimization Techniques

MethodsStochastic Gradient Descent