Effective Interplay between Sparsity and Quantization: From Theory to   Practice

Simla Burcu Harma; Ayan Chakraborty; Elizaveta Kostenok; Danila; Mishin; Dongho Ha; Babak Falsafi; Martin Jaggi; Ming Liu; Yunho Oh; Suvinay; Subramanian; Amir Yazdanbakhsh

arXiv:2405.20935·cs.LG·January 29, 2025·2 cites

Effective Interplay between Sparsity and Quantization: From Theory to Practice

Simla Burcu Harma, Ayan Chakraborty, Elizaveta Kostenok, Danila, Mishin, Dongho Ha, Babak Falsafi, Martin Jaggi, Ming Liu, Yunho Oh, Suvinay, Subramanian, Amir Yazdanbakhsh

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that sparsity and quantization in neural network compression are interconnected, and their combined effects can significantly impact model accuracy, emphasizing the importance of application order and error management.

Contribution

The paper provides the first mathematical proof of the non-orthogonality of sparsity and quantization, and offers practical insights into their combined effects on large models.

Findings

01

Order of applying methods affects accuracy.

02

Combined errors from sparsity and quantization can be significant.

03

Applying quantization before sparsity may disrupt important tensor elements.

Abstract

The increasing size of deep neural networks (DNNs) necessitates effective model compression to reduce their computational and memory footprints. Sparsity and quantization are two prominent compression methods that have been shown to reduce DNNs' computational and memory footprints significantly while preserving model accuracy. However, how these two methods interact when combined together remains a key question for developers, as many tacitly assume that they are orthogonal, meaning that their combined use does not introduce additional errors beyond those introduced by each method independently. In this paper, we provide the first mathematical proof that sparsity and quantization are non-orthogonal. We corroborate these results with experiments spanning a range of large language models, including the OPT and LLaMA model families (with 125M to 8B parameters), and vision models like ViT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Effective Interplay between Sparsity and Quantization: From Theory to Practice· slideslive

Taxonomy

TopicsNeural Networks and Applications · Advanced MEMS and NEMS Technologies

MethodsAverage Pooling · Global Average Pooling · Kaiming Initialization · Max Pooling · Convolution · OPT · LLaMA