Toward Efficient Permutation for Hierarchical N:M Sparsity on GPUs
Seungmin Yu, Xiaodie Yi, Hayun Lee, Dongkun Shin

TL;DR
This paper introduces a hierarchical N:M sparsity technique for neural networks, along with a specialized channel permutation method and GPU kernel, to improve accuracy and efficiency in sparse model compression on GPUs.
Contribution
It proposes gyro-permutation, a novel channel permutation strategy tailored for hierarchical N:M sparsity, and develops a GPU kernel to support layer permutation during sparse network execution.
Findings
Gyro-permutation significantly improves HiNM sparsity accuracy.
The GPU kernel enables efficient layer permutation during inference.
Hierarchical N:M sparsity achieves accuracy comparable to unstructured sparsity.
Abstract
N:M sparsity pruning is a powerful technique for compressing deep neural networks, utilizing NVIDIA's Sparse Tensor Core technology. This method benefits from hardware support for sparse indexing, enabling the adoption of fine-grained sparsity to maintain model accuracy while minimizing the overhead typically associated with irregular data access. Although restricted to a fixed level of sparsity due to its reliance on hardware, N:M sparsity can be combined with coarser sparsity techniques to achieve diverse compression ratios. Initially, column-wise vector sparsity is applied to a dense model, followed by row-wise N:M sparsity on the preserved column vectors. We call this multi-level approach as hierarchical N:M (HiNM) sparsity. Similar to earlier single-level sparsity techniques, HiNM sparsity necessitates an effective channel permutation strategy to maximize the accuracy of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Graph Theory and Algorithms · Algorithms and Data Compression
MethodsPruning
