ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation

Suyoung Kim; Sunghyun Wee; Hyeonjin Kim; Kyomin Hwang; Hyunho Lee; Nojun Kwak

arXiv:2604.11080·cs.CV·April 14, 2026

ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation

Suyoung Kim, Sunghyun Wee, Hyeonjin Kim, Kyomin Hwang, Hyunho Lee, Nojun Kwak

PDF

TL;DR

ReSpinQuant introduces an efficient layer-wise quantization method for LLMs that combines high accuracy with minimal inference overhead by using offline activation rotation fusion and residual subspace rotation.

Contribution

It proposes a novel quantization framework that achieves layer-wise adaptation with negligible overhead, outperforming global rotation methods and matching expensive layer-wise approaches.

Findings

01

ReSpinQuant outperforms global rotation methods in accuracy.

02

ReSpinQuant matches the accuracy of layer-wise methods.

03

ReSpinQuant incurs only negligible inference overhead.

Abstract

Rotation-based Post-Training Quantization (PTQ) has emerged as a promising solution for mitigating activation outliers in the quantization of Large Language Models (LLMs). Global rotation methods achieve inference efficiency by fusing activation rotations into attention and FFN blocks, but suffer from limited expressivity as they are constrained to use a single learnable rotation matrix across all layers. To tackle this, layer-wise transformation methods emerged, achieving superior accuracy through localized adaptation. However, layer-wise methods cannot fuse activation rotation matrices into weights, requiring online computations and causing significant overhead. In this paper, we propose ReSpinQuant, a quantization framework that resolves such overhead by leveraging offline activation rotation fusion and matching basis using efficient residual subspace rotation. This design reconciles…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.