Fast Inference with Kronecker-Sparse Matrices

Antoine Gonon; L\'eon Zheng; Pascal Carrivain; Quoc-Tung Le

arXiv:2405.15013·cs.LG·June 16, 2025

Fast Inference with Kronecker-Sparse Matrices

Antoine Gonon, L\'eon Zheng, Pascal Carrivain, Quoc-Tung Le

PDF

Open Access 1 Repo

TL;DR

This paper introduces a GPU kernel for Kronecker-sparse matrices that significantly reduces memory overhead and accelerates inference in models like ViT and GPT-2, with notable speedups and energy savings.

Contribution

A novel fused GPU kernel for Kronecker-sparse matrices that minimizes data movement and improves inference speed and energy efficiency.

Findings

01

Median speedup of x1.4 across 600 KS patterns

02

Up to 22% latency reduction in ViT-S/16

03

15% reduction in energy consumption

Abstract

Kronecker-sparse (KS) matrices -- whose supports are Kronecker products of identity and all-ones blocks -- underpin the structure of Butterfly and Monarch matrices and offer the promise of more efficient models. However, existing GPU kernels for KS matrix multiplication suffer from high data movement costs, with up to 50% of time spent on memory-bound tensor permutations. We propose a fused, output-stationary GPU kernel that eliminates these overheads, reducing global memory traffic threefold. Across 600 KS patterns, our kernel achieves in FP32 a median speedup of x1.4 and lowers energy consumption by 15%. A simple heuristic based on KS pattern parameters predicts when our method outperforms existing ones. We release all code at github.com/PascalCarrivain/ksmm, including a PyTorch-compatible KSLinear layer, and demonstrate in FP32 end-to-end latency reductions of up to 22% in ViT-S/16…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pascalcarrivain/ksmm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Ferroelectric and Negative Capacitance Devices · Parallel Computing and Optimization Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings