Navigating Extremes: Dynamic Sparsity in Large Output Spaces
Nasib Ullah, Erik Schultheis, Mike Lasby, Yani Ioannou and, Rohit Babbar

TL;DR
This paper explores the use of dynamic sparse training (DST) for large output classification tasks, demonstrating how to maintain efficiency and performance with millions of labels on standard hardware.
Contribution
It introduces a method to effectively apply DST to large output spaces by addressing gradient flow issues, enabling end-to-end training with massive label sets.
Findings
DST can be applied to large classification tasks with millions of labels.
Using an intermediate layer or auxiliary objectives improves performance.
The approach enables training on commodity hardware with large label spaces.
Abstract
In recent years, Dynamic Sparse Training (DST) has emerged as an alternative to post-training pruning for generating efficient models. In principle, DST allows for a more memory efficient training process, as it maintains sparsity throughout the entire training run. However, current DST implementations fail to capitalize on this in practice. Because sparse matrix multiplication is much less efficient than dense matrix multiplication on GPUs, most implementations simulate sparsity by masking weights. In this paper, we leverage recent advances in semi-structured sparse training to apply DST in the domain of classification with large output spaces, where memory-efficiency is paramount. With a label space of possibly millions of candidates, the classification layer alone will consume several gigabytes of memory. Switching from a dense to a fixed fan-in sparse layer updated with sparse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsComputational Physics and Python Applications
MethodsPruning · Dynamic Sparse Training
