Top-KAST: Top-K Always Sparse Training
Siddhant M. Jayakumar, Razvan Pascanu, Jack W. Rae, Simon Osindero,, Erich Elsen

TL;DR
Top-KAST introduces a sparse training method that maintains constant sparsity throughout training, enabling efficient large-scale model training with comparable or better performance on benchmarks like ImageNet and language modeling tasks.
Contribution
It presents a simple, effective approach for sparse training that avoids dense computations, facilitating scalable and resource-efficient training of large models.
Findings
Performs comparably or better than previous methods on ImageNet
Enables training of large language models with fewer resources
Easy to implement in existing frameworks
Abstract
Sparse neural networks are becoming increasingly important as the field seeks to improve the performance of existing models by scaling them up, while simultaneously trying to reduce power consumption and computational footprint. Unfortunately, most existing methods for inducing performant sparse models still entail the instantiation of dense parameters, or dense gradients in the backward-pass, during training. For very large models this requirement can be prohibitive. In this work we propose Top-KAST, a method that preserves constant sparsity throughout training (in both the forward and backward-passes). We demonstrate the efficacy of our approach by showing that it performs comparably to or better than previous works when training models on the established ImageNet benchmark, whilst fully maintaining sparsity. In addition to our ImageNet results, we also demonstrate our approach in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
