How much pre-training is enough to discover a good subnetwork?
Cameron R. Wolfe, Fangshuo Liao, Qihan Wang, Junhyung Lyle Kim,, Anastasios Kyrillidis

TL;DR
This paper provides a theoretical analysis of the amount of pre-training needed for neural network pruning to produce high-performing subnetworks, supported by empirical validation on MNIST.
Contribution
It introduces a theoretical bound on pre-training iterations necessary for effective pruning, linking pre-training duration to dataset size.
Findings
A logarithmic relationship between dataset size and pre-training threshold.
A theoretical bound on pre-training iterations for pruning effectiveness.
Empirical validation on MNIST confirms theoretical predictions.
Abstract
Neural network pruning is useful for discovering efficient, high-performing subnetworks within pre-trained, dense network architectures. More often than not, it involves a three-step process -- pre-training, pruning, and re-training -- that is computationally expensive, as the dense model must be fully pre-trained. While previous work has revealed through experiments the relationship between the amount of pre-training and the performance of the pruned network, a theoretical characterization of such dependency is still missing. Aiming to mathematically analyze the amount of dense network pre-training needed for a pruned network to perform well, we discover a simple theoretical bound in the number of gradient descent pre-training iterations on a two-layer, fully-connected network, beyond which pruning via greedy forward selection [61] yields a subnetwork that achieves good training error.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Machine Learning and ELM · Domain Adaptation and Few-Shot Learning
MethodsPruning · Stochastic Gradient Descent
