SVD Contextual Sparsity Predictors for Fast LLM Inference

Georgii Serbin; Kirill Koshkin; Zhongao Sun; Anastasiya Bistrigova; C.C. Korikov

arXiv:2603.14110·cs.LG·March 17, 2026

SVD Contextual Sparsity Predictors for Fast LLM Inference

Georgii Serbin, Kirill Koshkin, Zhongao Sun, Anastasiya Bistrigova, C.C. Korikov

PDF

Open Access

TL;DR

This paper introduces a fast, training-free SVD-based method for predicting sparse patterns in LLMs' feed-forward networks, significantly reducing inference time with minimal accuracy loss, facilitating edge deployment.

Contribution

It presents a novel, training-free framework using truncation-aware SVD and threshold calibration for efficient sparse pattern prediction in LLMs' FFNs.

Findings

01

Achieved up to 1.8x reduction in decoding time.

02

Maintained less than 1% accuracy degradation.

03

Demonstrated effectiveness on three sparse LLMs.

Abstract

Contextual sparsity is one of the approaches used to reduce computational complexity in the inference process of large language models (LLMs). Existing techniques for efficient LLM inference acceleration based on contextual sparsity with minimal accuracy degradation require training sparse pattern predictors. This paper presents a framework for accelerating inference of ReGLU-based feed-forward networks (FFNs) within LLMs. The proposed framework provides a fast, training-free method for building sparse pattern predictors using truncation-aware singular value decomposition (SVD) of the gate projection matrix, along with a threshold calibration algorithm, and inference executors supporting conditional computation on CUDA and CANN devices. Experiments on three sparse LLMs with an average activation sparsity level of 90% in the FFNs demonstrate up to a 1.8x reduction in end-to-end decoding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Speech Recognition and Synthesis