UltraSketchLLM: Saliency-Driven Sketching for Ultra-Low Bit LLM Compression
Sunan Zou, Ziyun Zhang, Xueting Sun, Guojie Luo

TL;DR
UltraSketchLLM introduces a novel sketch-based compression method that reduces LLM weights to 0.5 bits per weight, maintaining performance and enabling deployment on edge devices with limited memory.
Contribution
It proposes an index-free, sketch-based framework using data sketching techniques for ultra-low bit LLM compression, surpassing previous methods in efficiency and accuracy.
Findings
Achieves up to 0.5-bit compression on Llama-3.2-1B.
Maintains competitive perplexity with minimal latency overhead.
Provides a practical solution for resource-constrained LLM deployment.
Abstract
The rapid growth of large language models (LLMs) has outpaced the memory constraints of edge devices, necessitating extreme weight compression beyond the 1-bit limit. While quantization reduces model size, it is fundamentally limited to 1 bit per weight. Existing multiple-to-one compression methods either rely on mapping tables (inducing memory overhead) or incur severe accuracy degradation due to random weight grouping. We introduce UltraSketchLLM, an index-free, sketch-based framework that achieves ultra-low bit compression (down to 0.5 bits per weight) while preserving model performance. UltraSketchLLM leverages data sketching, a sub-linear representation technique from streaming applications, to map multiple weights to single values with bounded error. Our approach integrates an underestimate AbsMaxMin sketch to minimize relative errors for small weights, importance-aware space…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
