BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization

Venugopalan Iyengar

arXiv:2605.10655·cs.LG·May 12, 2026

BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization

Venugopalan Iyengar

PDF

TL;DR

This paper introduces BCJR-QAT, a differentiable relaxation for trellis-coded weight quantization, enabling effective quantization-aware training of large language models with improved performance.

Contribution

The paper proposes BCJR-QAT, a novel differentiable approach replacing the non-differentiable argmax with a soft sum-product algorithm, improving quantization of LLMs.

Findings

01

BCJR-QAT achieves a 6.57x speedup with fp32 parity on GPU.

02

It outperforms QTIP-PTQ by -0.084 PPL on WikiText-2.

03

Empirical results show improved Llama-3.2-1B quantization performance.

Abstract

Trellis-coded quantization sets the current 2-bit post-training frontier for LLMs (QTIP), but pushing below the PTQ ceiling requires quantization-aware training, and QAT on a trellis is obstructed by the non-differentiable Viterbi argmax. We introduce BCJR-QAT, a relaxation that replaces the argmax with the BCJR forward-backward sum-product algorithm at temperature $T$ , producing a soft codeword equal to the Boltzmann expectation over trellis paths, exactly differentiable, recovering the hard QTIP code as $T \to 0$ , and mathematically identical to the transfer-matrix computation for a 1D Ising-like spin chain. We contribute (i) a fused Triton kernel making BCJR tractable on a single consumer GPU ( $6.57 \times$ speedup, fp32 parity); (ii) a quantitative drift-budget theory of when BCJR-QAT can escape the QTIP-PTQ Voronoi basin, verified across four experiments; and (iii) a positive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.