Sherry: Hardware-Efficient 1.25-Bit Ternary Quantization via Fine-grained Sparsification

Hong Huang; Decheng Wu; Qiangqiang Hu; Guanghua Yu; Jinhai Yang; Jianchen Zhu; Xue Liu; Dapeng Wu

arXiv:2601.07892·cs.LG·January 14, 2026

Sherry: Hardware-Efficient 1.25-Bit Ternary Quantization via Fine-grained Sparsification

Hong Huang, Decheng Wu, Qiangqiang Hu, Guanghua Yu, Jinhai Yang, Jianchen Zhu, Xue Liu, Dapeng Wu

PDF

Open Access 3 Models

TL;DR

Sherry introduces a hardware-efficient ternary quantization method that uses fine-grained sparsity to achieve 1.25-bit packing, maintaining model accuracy while reducing size and increasing inference speed on edge devices.

Contribution

It proposes a novel 3:4 sparsity scheme with 1.25-bit packing and a training mechanism to prevent representational collapse, enabling efficient deployment of LLMs on resource-constrained hardware.

Findings

01

Achieves 25% bit savings over state-of-the-art methods.

02

Maintains zero accuracy loss on LLaMA-3.2 models.

03

Provides 10% faster inference on CPU.

Abstract

The deployment of Large Language Models (LLMs) on resource-constrained edge devices is increasingly hindered by prohibitive memory and computational requirements. While ternary quantization offers a compelling solution by reducing weights to {-1, 0, +1}, current implementations suffer from a fundamental misalignment with commodity hardware. Most existing methods must choose between 2-bit aligned packing, which incurs significant bit wastage, or 1.67-bit irregular packing, which degrades inference speed. To resolve this tension, we propose Sherry, a hardware-efficient ternary quantization framework. Sherry introduces a 3:4 fine-grained sparsity that achieves a regularized 1.25-bit width by packing blocks of four weights into five bits, restoring power-of-two alignment. Furthermore, we identify weight trapping issue in sparse ternary training, which leads to representational collapse. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Natural Language Processing Techniques