IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression
Zhongping Ji

TL;DR
IsoQuant introduces a quaternion-based, hardware-aligned rotation framework for efficient low-bit vector quantization in large language model key-value cache compression, achieving significant speedups with minimal accuracy loss.
Contribution
It proposes a novel quaternion algebra and isoclinic decomposition approach for hardware-efficient SO(4) rotations, improving upon prior dense orthogonal transforms.
Findings
IsoQuant-Full reduces rotation cost by over 50% compared to RotorQuant.
Achieves 4.5x to 4.7x kernel speedups over RotorQuant.
Maintains comparable reconstruction MSE with peak speedups above 6x.
Abstract
Orthogonal feature decorrelation is effective for low-bit online vector quantization, but dense random orthogonal transforms incur prohibitive storage and compute. RotorQuant reduces this cost with blockwise D Clifford rotors, yet the resulting D partition is poorly aligned with modern hardware and offers limited local mixing. We propose \textbf{IsoQuant}, a blockwise rotation framework based on quaternion algebra and the isoclinic decomposition of . It represents each D block as a quaternion and applies a closed-form transform . This yields two main variants: \emph{IsoQuant-Full}, which realizes the full rotation, and \emph{IsoQuant-Fast}, which keeps only one isoclinic factor for lower cost; the framework also admits a lightweight D special case. At , IsoQuant-Full reduces forward rotation cost from about…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
