ROSAQ: Rotation-based Saliency-Aware Weight Quantization for Efficiently Compressing Large Language Models

Junho Yoon; Geom Lee; Donghyeon Jeon; Inho Kang; Seung-Hoon Na

arXiv:2506.13472·cs.CL·June 18, 2025

ROSAQ: Rotation-based Saliency-Aware Weight Quantization for Efficiently Compressing Large Language Models

Junho Yoon, Geom Lee, Donghyeon Jeon, Inho Kang, Seung-Hoon Na

PDF

Open Access

TL;DR

ROSAQ introduces a rotation-based, saliency-aware weight quantization method for large language models, leveraging PCA and mixed-precision to improve compression efficiency and inference speed.

Contribution

It proposes a novel PCA-based rotation and saliency detection approach for weight quantization, enhancing model compression and inference speed.

Findings

01

Outperforms baseline saliency-aware quantization methods.

02

Achieves 2.3x speed-up in token generation.

03

Improves model compression with minimal accuracy loss.

Abstract

Quantization has been widely studied as an effective technique for reducing the memory requirement of large language models (LLMs), potentially improving the latency time as well. Utilizing the characteristic of rotational invariance of transformer, we propose the rotation-based saliency-aware weight quantization (ROSAQ), which identifies salient channels in the projection feature space, not in the original feature space, where the projected "principal" dimensions are naturally considered as "salient" features. The proposed ROSAQ consists of 1) PCA-based projection, which first performs principal component analysis (PCA) on a calibration set and transforms via the PCA projection, 2) Salient channel dentification, which selects dimensions corresponding to the K-largest eigenvalues as salient channels, and 3) Saliency-aware quantization with mixed-precision, which uses FP16 for salient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Graph Neural Networks

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Principal Components Analysis · Sparse Evolutionary Training