TL;DR
RaBiT introduces a novel residual hierarchy-based binarization framework for large language models, significantly improving accuracy and efficiency in quantization-aware training.
Contribution
It proposes a new method to prevent feature co-adaptation in binary paths by enforcing a residual hierarchy derived from shared weights.
Findings
RaBiT achieves state-of-the-art 2-bit LLM performance.
It delivers a 4.49x inference speed-up on RTX 4090.
RaBiT rivals hardware-intensive vector quantization methods.
Abstract
Efficient deployment of large language models (LLMs) requires extreme quantization, forcing a critical trade-off between low-bit efficiency and performance. Residual binarization enables hardware-friendly, matmul-free inference by stacking binary (1) layers, but is plagued by pathological feature co-adaptation. We identify a key failure mode, which we term inter-path adaptation: during quantization-aware training (QAT), parallel residual binary paths learn redundant features, degrading the error-compensation structure and limiting the expressive capacity of the model. While prior work relies on heuristic workarounds (e.g., path freezing) that constrain the solution space, we propose RaBiT, a novel quantization framework that resolves co-adaptation by algorithmically enforcing a residual hierarchy. Its core mechanism sequentially derives each binary path from a single shared…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
I generally think this is a solid paper, with a clear presentation and a well-motivated method. The experiments show good quality, and are also complemented by real runtime measurements.
- Experiments limited to fine-tuning, which limits the impact of the paper. - QTIP seems to outperform RABIT in terms of accuracy/loss on larger models.
1. The paper identifies “inter-path adaptation” (multiple binary paths learn redundant features under shared gradients) as the key reason multi-binary LLMs underperform. This is a concrete, observable failure mode. 2. The hypothesis is empirically supported. Switching from the “Standard QAT” to the “Coupled / residual-aware” training already recovers a large portion of the quantization gap, and further initialization helps. 3. Simple inference-time form. After training, the model is reduced to a
1. On the larger models, RaBiT does not clearly win zero-shot across tasks. For models of that size, zero-shot should be the headline, not only commonsense-style scores. (There’s also a mis-bolded PIQA number for LLaMA-13B in the appendix.) 2. Results are single-run, no CIs, no multi-seed or alternative calibration subsets. So close numbers vs baselines are not conclusive and could flip with another seed. 3. No long-context or instruction-tuned/chat evaluations. Given the close zero-shot numbers
- The paper is well written and easy to follow. - The paper aims to tackle a hot and important topic, which is to reduce the inference complexity of LLMs. - The paper demonstrates strong performance in accuracy and latency compared to the existing baseline, although it raised some concerns to the reviewer.
- The paper’s main analysis and motivation focus on the decomposition of the MSE loss presented on page three. However, the experimental results contradict this hypothesis. The proposed framework combines KL divergence with intermediate MSE losses, yet the contribution of this complex knowledge distillation (KD) setup is never ablated. Notably, the authors disable this component for the Gemma models ($\gamma=0$) to “avoid instability,” implying that the KD mechanism is sensitive and not universa
1. The theoretical analysis of coupled QAT is interesting, with adequate details and analysis. Relevant experiments are also conducted to validate the theory. 2. RaBiT exhibits impressive performance, hitting SOTA in 2-bit quantization with satisfactory speedup. 3. Sufficient details on training settings and kernel design are provided, which may benefit the community for future research.
1. This work’s primary contribution lies in the coupled QAT training framework; however, the residual binarization scheme and initialization method have been extensively explored in prior studies. This may constrain the novelty of the paper. 2. It remains unclear whether the compared methods were trained on identical datasets with the same number of iterations. Additionally, the optimizers employed in these baselines differ from Muon. Considering that Muon might result in better performance, it
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
