Sherry: Hardware-Efficient 1.25-Bit Ternary Quantization via Fine-grained Sparsification
Hong Huang, Decheng Wu, Qiangqiang Hu, Guanghua Yu, Jinhai Yang, Jianchen Zhu, Xue Liu, Dapeng Wu

TL;DR
Sherry introduces a hardware-efficient ternary quantization method that uses fine-grained sparsity to achieve 1.25-bit packing, maintaining model accuracy while reducing size and increasing inference speed on edge devices.
Contribution
It proposes a novel 3:4 sparsity scheme with 1.25-bit packing and a training mechanism to prevent representational collapse, enabling efficient deployment of LLMs on resource-constrained hardware.
Findings
Achieves 25% bit savings over state-of-the-art methods.
Maintains zero accuracy loss on LLaMA-3.2 models.
Provides 10% faster inference on CPU.
Abstract
The deployment of Large Language Models (LLMs) on resource-constrained edge devices is increasingly hindered by prohibitive memory and computational requirements. While ternary quantization offers a compelling solution by reducing weights to {-1, 0, +1}, current implementations suffer from a fundamental misalignment with commodity hardware. Most existing methods must choose between 2-bit aligned packing, which incurs significant bit wastage, or 1.67-bit irregular packing, which degrades inference speed. To resolve this tension, we propose Sherry, a hardware-efficient ternary quantization framework. Sherry introduces a 3:4 fine-grained sparsity that achieves a regularized 1.25-bit width by packing blocks of four weights into five bits, restoring power-of-two alignment. Furthermore, we identify weight trapping issue in sparse ternary training, which leads to representational collapse. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Natural Language Processing Techniques
