SpinQuant: LLM quantization with learned rotations
Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv, Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, Tijmen, Blankevoort

TL;DR
SpinQuant introduces learned rotation matrices for LLM quantization, significantly improving accuracy over prior methods by reducing quantization errors and outliers, especially in challenging models like LLaMA-3 8B.
Contribution
The paper proposes a novel learned rotation approach for LLM quantization, outperforming existing methods by optimizing rotation parameters for better accuracy.
Findings
SpinQuant narrows the accuracy gap to full precision to 2.9 points on LLaMA-2 7B.
It surpasses LLM-QAT by 19.1 points and SmoothQuant by 25.0 points in zero-shot reasoning.
Reduces the quantization gap by up to 45.1% on LLaMA-3 8B models.
Abstract
Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs), but may lead to large quantization errors when outliers are present. Rotating activation or weight matrices helps remove outliers and benefits quantization. In this work, we identify a collection of applicable rotation parameterizations that lead to identical outputs in full-precision Transformer architectures while enhancing quantization accuracy. In addition, we find that some random rotations lead to much better quantization than others, with an up to 13 points difference in downstream zero-shot reasoning performance. As a result, we propose SpinQuant, a novel approach that incorporates learned rotation matrices for optimal quantized network accuracy. With 4-bit quantization of weight, activation,…
Peer Reviews
Decision·ICLR 2025 Poster
The idea of learned rotation in the context of LLM activation outlier reduction seems interesting. The authors have done thorough analysis to justify the need of learned rotation $\texttt{SpinQuant}$ can outperform baseline GPTQ by significant margin!
1. The performance of $\texttt{SpinQuant}$ is similar or worse than QuIP in most of the cases. This further raises concerns on the learned rotation method as QuIP does not rely on any activation rotation method. 2. The calibration discussion (thorough) is missing. As the method needs to learn R1,R2, R3 and R4, such calibration and/or fine-tuning overhead is a concern. 3. How the learning gets affected for activation rotation, when there is extremely low calibration data? 4. The can be other
+ The paper pursues the rotation method for post training LLM quantization, which demonstrates convincing performance. + The authors conducted thorough investigations about random rotations to propose quantization-oriented rotation learning. + The proposed framework develops strategies for the challenging activation and KV cache quantization, besides weight quantization.
- While the rotation matrix method is effective in mitigating outliers in LLM quantization, it is better if more rigorous explanations/proofs are provided for the rationale behind, besides empirical results. - Introducing rotation matrices into the inference pipeline may lead to overhead in computing. Could the authors provide analysis the overhead vs benefit from quantization?
- By employing learned rotations, SpinQuant offers a new approach to minimizing outliers. - The SpinQuant-easy and SpinQuant-hard modes offer flexibility, making the approach adaptable to different computational constraints and accuracy requirements through mergeable and non-mergeable weights. - SpinQuant shows compatibility with methods like GPTQ, allowing it to integrate into established quantization pipelines.
- The benchmarks and model selections, such as the LLaMA series, seem to focus on favorable cases for SpinQuant. Testing on a broader range of architectures (e.g., Gemma2) would provide a more thorough evaluation. Especially Gemma2 series which exhibit distinct activation characteristics—could provide a more comprehensive assessment of SpinQuant's generalizability and robustness across diverse LLM types. - Although the authors note the optimization time (up to 3.5 hours for larger models), it re
Code & Models
- 🤗meta-llama/Llama-3.2-3B-Instructmodel· 7.4M dl· ♡ 20787.4M dl♡ 2078
- 🤗meta-llama/Llama-3.2-3Bmodel· 1.2M dl· ♡ 7211.2M dl♡ 721
- 🤗meta-llama/Llama-3.2-1Bmodel· 1.8M dl· ♡ 23491.8M dl♡ 2349
- 🤗meta-llama/Llama-3.2-1B-Instructmodel· 4.1M dl· ♡ 13434.1M dl♡ 1343
- 🤗meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8model· 179 dl· ♡ 47179 dl♡ 47
- 🤗meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8model· 95 dl· ♡ 3895 dl♡ 38
- 🤗meta-llama/Llama-3.2-3B-Instruct-QLORA_INT4_EO8model· 55 dl· ♡ 7155 dl♡ 71
- 🤗meta-llama/Llama-3.2-3B-Instruct-SpinQuant_INT4_EO8model· 50 dl· ♡ 3950 dl♡ 39
- 🤗venkycs/llama-3.2-3b-instruct-abliteratedmodel· 4 dl4 dl
- 🤗curiousily/Llama-3.2-1B-Mental-Health-Sentimentmodel· 175 dl175 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Algorithms and Data Compression
MethodsLinear Layer · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections
