MACKO: Sparse Matrix-Vector Multiplication for Low Sparsity
Vladim\'ir Macko, Vladim\'ir Bo\v{z}a

TL;DR
MACKO-SpMV is a GPU-efficient sparse matrix-vector multiplication method that significantly reduces memory and increases speed for low sparsity LLMs without requiring specialized hardware.
Contribution
The paper introduces MACKO-SpMV, a novel GPU-optimized format and kernel that improves efficiency for unstructured sparsity in LLM inference without specialized hardware.
Findings
1.5x memory reduction at 50% sparsity
Speedup of 1.2-1.5x over dense representation
Significant improvements over existing SpMV baselines
Abstract
Sparse Matrix-Vector Multiplication (SpMV) is a fundamental operation in the inference of sparse Large Language Models (LLMs). Because existing SpMV methods perform poorly under the low and unstructured sparsity (30-90%) commonly observed in pruned LLMs, unstructured pruning provided only limited memory reduction and speedup. We propose MACKO-SpMV, a GPU-optimized format and kernel co-designed to reduce storage overhead while preserving compatibility with the GPU's execution model. This enables efficient SpMV for unstructured sparsity without specialized hardware units (e.g., tensor cores) or format-specific precomputation. Empirical results show that at sparsity 50%, MACKO is the first approach with significant 1.5x memory reduction and 1.2-1.5x speedup over dense representation. Speedups over other SpMV baselines: 2.8-13.0x over cuSPARSE, 1.9-2.6x over Sputnik, and 2.2-2.5x over DASP.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Big Data and Digital Economy
