ROSAQ: Rotation-based Saliency-Aware Weight Quantization for Efficiently Compressing Large Language Models
Junho Yoon, Geom Lee, Donghyeon Jeon, Inho Kang, Seung-Hoon Na

TL;DR
ROSAQ introduces a rotation-based, saliency-aware weight quantization method for large language models, leveraging PCA and mixed-precision to improve compression efficiency and inference speed.
Contribution
It proposes a novel PCA-based rotation and saliency detection approach for weight quantization, enhancing model compression and inference speed.
Findings
Outperforms baseline saliency-aware quantization methods.
Achieves 2.3x speed-up in token generation.
Improves model compression with minimal accuracy loss.
Abstract
Quantization has been widely studied as an effective technique for reducing the memory requirement of large language models (LLMs), potentially improving the latency time as well. Utilizing the characteristic of rotational invariance of transformer, we propose the rotation-based saliency-aware weight quantization (ROSAQ), which identifies salient channels in the projection feature space, not in the original feature space, where the projected "principal" dimensions are naturally considered as "salient" features. The proposed ROSAQ consists of 1) PCA-based projection, which first performs principal component analysis (PCA) on a calibration set and transforms via the PCA projection, 2) Salient channel dentification, which selects dimensions corresponding to the K-largest eigenvalues as salient channels, and 3) Saliency-aware quantization with mixed-precision, which uses FP16 for salient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Graph Neural Networks
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Principal Components Analysis · Sparse Evolutionary Training
