Fast Inference with Kronecker-Sparse Matrices
Antoine Gonon, L\'eon Zheng, Pascal Carrivain, Quoc-Tung Le

TL;DR
This paper introduces a GPU kernel for Kronecker-sparse matrices that significantly reduces memory overhead and accelerates inference in models like ViT and GPT-2, with notable speedups and energy savings.
Contribution
A novel fused GPU kernel for Kronecker-sparse matrices that minimizes data movement and improves inference speed and energy efficiency.
Findings
Median speedup of x1.4 across 600 KS patterns
Up to 22% latency reduction in ViT-S/16
15% reduction in energy consumption
Abstract
Kronecker-sparse (KS) matrices -- whose supports are Kronecker products of identity and all-ones blocks -- underpin the structure of Butterfly and Monarch matrices and offer the promise of more efficient models. However, existing GPU kernels for KS matrix multiplication suffer from high data movement costs, with up to 50% of time spent on memory-bound tensor permutations. We propose a fused, output-stationary GPU kernel that eliminates these overheads, reducing global memory traffic threefold. Across 600 KS patterns, our kernel achieves in FP32 a median speedup of x1.4 and lowers energy consumption by 15%. A simple heuristic based on KS pattern parameters predicts when our method outperforms existing ones. We release all code at github.com/PascalCarrivain/ksmm, including a PyTorch-compatible KSLinear layer, and demonstrate in FP32 end-to-end latency reductions of up to 22% in ViT-S/16…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Ferroelectric and Negative Capacitance Devices · Parallel Computing and Optimization Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
