Repeated Random Sampling for Minimizing the Time-to-Accuracy of Learning
Patrik Okanovic, Roger Waleffe, Vasilis Mageirakos, Konstantinos E., Nikolakakis, Amin Karbasi, Dionysis Kalogerias, Nezihe Merve G\"urel,, Theodoros Rekatsinas

TL;DR
This paper introduces RS2, a simple random sampling method that significantly reduces training time-to-accuracy for neural networks by sampling different data subsets each epoch, outperforming many existing data pruning and distillation techniques.
Contribution
The paper proposes RS2, a novel repeated random sampling strategy, demonstrating its effectiveness in reducing training time and improving accuracy over state-of-the-art methods on large datasets like ImageNet.
Findings
RS2 reduces time-to-accuracy by up to 7x on ImageNet.
RS2 achieves up to 29% accuracy improvement in high-compression regimes.
RS2 outperforms many existing data pruning and distillation methods.
Abstract
Methods for carefully selecting or generating a small set of training data to learn from, i.e., data pruning, coreset selection, and data distillation, have been shown to be effective in reducing the ever-increasing cost of training neural networks. Behind this success are rigorously designed strategies for identifying informative training examples out of large datasets. However, these strategies come with additional computational costs associated with subset selection or data distillation before training begins, and furthermore, many are shown to even under-perform random sampling in high data compression regimes. As such, many data pruning, coreset selection, or distillation methods may not reduce 'time-to-accuracy', which has become a critical efficiency measure of training deep neural networks over large datasets. In this work, we revisit a powerful yet overlooked random sampling…
Peer Reviews
Decision·ICLR 2024 poster
+ The highlight of an overlooked baseline in the context of dataset pruning/distillation. + Intensive experiments over so many baselines. This provides a very good benchmark and starting point for the following works, which I find really appreciable.
- The RS2 without replacement is exactly the same as reducing the number of training epochs but with tuned learning rate scheduling. The new term is not helping to make the concept clear but more confusing. This also means that the theoretical analysis in Section 4 did not make actual contributions over previous work. - In my opinion, a type of dataset pruning methods, which generate static subsets before real training starts, are up to a slightly different point from RS2. While we all know tha
1. The paper presents a simple but novel approach to achieve significant reductions in time-to-accuracy while training on a fraction of the full dataset per epoch of model training. 2. The paper also presents detailed theoretical properties that support the faster convergence of the model as compared to existing approaches in the domain. 3. The paper demonstrates results on four image datasets including large scale image benchmarks like ImageNet wherein it achieves State-of-the-Art (SoTA) perfor
1. Although the experimental results are exemplary (primary contributor to my decision), the method RS2 itself is an incremental update over random sampling. The paper must call out the clear difference with SoTA methods (please refer to questions for more details). 2. All experiments demonstrated in the paper adopt canonical benchmarks which are well curated, while lacking experiments on datasets (eg: MedMNIST (Yang et al., 2021), CUBS-2011 (Wah et al., 2011)) with large intra-class variance an
- The authors point out important considerations missing in prior work on speeding up training with adaptive dataset subset selection. First, whether there is a need to restrict data to a fixed subset in the first place if similar accuracy can be achieved by training with a compressed learning rate schedule on fewer epochs. Second, the importance of including overhead associated with data selection when evaluating training compute efficiency of an approach.
- As I understand, RS2 without replacement is effectively the same as training on the full dataset with the learning rate schedule compressed into fewer epochs. RS2 with replacement is a slight variant to that but still highly resembles standard training with shuffling between epochs just with a condensed training window. This is not discussed anywhere but brings into question the whole exposition of proposing RS2 as a sampling method. An even simpler baseline is training as usual on the full
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Machine Learning and Data Classification · Domain Adaptation and Few-Shot Learning
MethodsPruning · Test
