ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing
Edward J. Yoon

TL;DR
ITQ3_S introduces a 3-bit weight quantization method for LLMs that uses rotation-domain smoothing and a fused inverse transform to achieve high fidelity and efficiency on consumer GPUs.
Contribution
It proposes a novel interleaved ternary quantization format with a rotation strategy and a fused CUDA kernel, enabling high-accuracy 3-bit LLM inference.
Findings
Achieves perplexity comparable to FP16 on NVIDIA RTX 5090.
Provides over 1.5x throughput compared to 4-bit methods.
Ensures reconstruction error is bounded solely by the quantization grid.
Abstract
We present ITQ3_S (Interleaved Ternary Quantization -- Specialized), a novel 3-bit weight quantization format for LLMs integrating TurboQuant (TQ), a rotation-domain strategy based on the Fast Walsh-Hadamard Transform (FWHT). Conventional 3-bit methods suffer precision loss from heavy-tailed weight distributions and inter-channel outliers. ITQ3_S pre-rotates the weight space via FWHT before quantization, spreading outlier energy across the vector and inducing a near-Gaussian distribution amenable to uniform ternary coding. We derive a rigorous dequantization procedure fusing a 256-point Inverse FWHT into the CUDA shared-memory loading stage, ensuring reconstruction error is bounded exclusively by the ternary quantization grid with no additional error from the transform inversion. For any weight vector , the reconstruction satisfies $\|\hat{\mathbf{w}}…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
