BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization
Venugopalan Iyengar

TL;DR
This paper introduces BCJR-QAT, a differentiable relaxation for trellis-coded weight quantization, enabling effective quantization-aware training of large language models with improved performance.
Contribution
The paper proposes BCJR-QAT, a novel differentiable approach replacing the non-differentiable argmax with a soft sum-product algorithm, improving quantization of LLMs.
Findings
BCJR-QAT achieves a 6.57x speedup with fp32 parity on GPU.
It outperforms QTIP-PTQ by -0.084 PPL on WikiText-2.
Empirical results show improved Llama-3.2-1B quantization performance.
Abstract
Trellis-coded quantization sets the current 2-bit post-training frontier for LLMs (QTIP), but pushing below the PTQ ceiling requires quantization-aware training, and QAT on a trellis is obstructed by the non-differentiable Viterbi argmax. We introduce BCJR-QAT, a relaxation that replaces the argmax with the BCJR forward-backward sum-product algorithm at temperature , producing a soft codeword equal to the Boltzmann expectation over trellis paths, exactly differentiable, recovering the hard QTIP code as , and mathematically identical to the transfer-matrix computation for a 1D Ising-like spin chain. We contribute (i) a fused Triton kernel making BCJR tractable on a single consumer GPU ( speedup, fp32 parity); (ii) a quantitative drift-budget theory of when BCJR-QAT can escape the QTIP-PTQ Voronoi basin, verified across four experiments; and (iii) a positive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
