TL;DR
This paper introduces an output-aware EM initialisation method for extreme LLM quantization, significantly improving codebook optimisation and model performance at low bit precision.
Contribution
It identifies codebook initialisation as a key bottleneck and proposes OA-EM, a novel initialisation technique that enhances quantization quality across various models and compression rates.
Findings
OA-EM outperforms existing initialisation methods after PV-tuning.
Better initialisation leads to improved perplexity, especially at 2-bit precision.
The severity of initialisation issues scales with the representational ratio ho.
Abstract
Additive quantization enables extreme LLM compression with O(1) lookup-table dequantization, making it attractive for edge deployment. Yet at 2-bit precision, it often fails catastrophically, even with extensive search and finetuning. We show that the dominant bottleneck is codebook initialisation. Greedy sequential initialisation frequently places the model in poor optimisation regions that subsequent beam search and PV-tuning struggle to overcome. We analyse this behaviour through the representational ratio \r{ho} = N/KM, which characterises the relationship between weight groups and codebook capacity, and propose OA-EM, an output-aware EM initialisation method using Hessian-weighted Mahalanobis distance. Across compression rates, search budgets, and three architectures (Llama 3.2 3B, Llama 3.1 8B, Qwen 2.5 3B), OA-EM consistently produces better solutions after PV-tuning and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
