ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing

Edward J. Yoon

arXiv:2603.27914·cs.LG·April 1, 2026

ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing

Edward J. Yoon

PDF

TL;DR

ITQ3_S introduces a 3-bit weight quantization method for LLMs that uses rotation-domain smoothing and a fused inverse transform to achieve high fidelity and efficiency on consumer GPUs.

Contribution

It proposes a novel interleaved ternary quantization format with a rotation strategy and a fused CUDA kernel, enabling high-accuracy 3-bit LLM inference.

Findings

01

Achieves perplexity comparable to FP16 on NVIDIA RTX 5090.

02

Provides over 1.5x throughput compared to 4-bit methods.

03

Ensures reconstruction error is bounded solely by the quantization grid.

Abstract

We present ITQ3_S (Interleaved Ternary Quantization -- Specialized), a novel 3-bit weight quantization format for LLMs integrating TurboQuant (TQ), a rotation-domain strategy based on the Fast Walsh-Hadamard Transform (FWHT). Conventional 3-bit methods suffer precision loss from heavy-tailed weight distributions and inter-channel outliers. ITQ3_S pre-rotates the weight space via FWHT before quantization, spreading outlier energy across the vector and inducing a near-Gaussian distribution amenable to uniform ternary coding. We derive a rigorous dequantization procedure fusing a 256-point Inverse FWHT into the CUDA shared-memory loading stage, ensuring reconstruction error is bounded exclusively by the ternary quantization grid with no additional error from the transform inversion. For any weight vector $w \in R^{256}$ , the reconstruction satisfies $\|\hat{\mathbf{w}}…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.