SVD Contextual Sparsity Predictors for Fast LLM Inference
Georgii Serbin, Kirill Koshkin, Zhongao Sun, Anastasiya Bistrigova, C.C. Korikov

TL;DR
This paper introduces a fast, training-free SVD-based method for predicting sparse patterns in LLMs' feed-forward networks, significantly reducing inference time with minimal accuracy loss, facilitating edge deployment.
Contribution
It presents a novel, training-free framework using truncation-aware SVD and threshold calibration for efficient sparse pattern prediction in LLMs' FFNs.
Findings
Achieved up to 1.8x reduction in decoding time.
Maintained less than 1% accuracy degradation.
Demonstrated effectiveness on three sparse LLMs.
Abstract
Contextual sparsity is one of the approaches used to reduce computational complexity in the inference process of large language models (LLMs). Existing techniques for efficient LLM inference acceleration based on contextual sparsity with minimal accuracy degradation require training sparse pattern predictors. This paper presents a framework for accelerating inference of ReGLU-based feed-forward networks (FFNs) within LLMs. The proposed framework provides a fast, training-free method for building sparse pattern predictors using truncation-aware singular value decomposition (SVD) of the gate projection matrix, along with a threshold calibration algorithm, and inference executors supporting conditional computation on CUDA and CANN devices. Experiments on three sparse LLMs with an average activation sparsity level of 90% in the FFNs demonstrate up to a 1.8x reduction in end-to-end decoding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Neural Network Applications · Speech Recognition and Synthesis
