Squeeze10-LLM: Squeezing LLMs' Weights by 10 Times via a Staged Mixed-Precision Quantization Method

Qingcheng Zhu; Yangyang Ren; Linlin Yang; Mingbao Lin; Yanjing Li; Sheng Xu; Zichao Feng; Haodong Zhu; Yuguang Yang; Juan Zhang; Runqi Wang; Baochang Zhang

arXiv:2507.18073·cs.LG·July 25, 2025

Squeeze10-LLM: Squeezing LLMs' Weights by 10 Times via a Staged Mixed-Precision Quantization Method

Qingcheng Zhu, Yangyang Ren, Linlin Yang, Mingbao Lin, Yanjing Li, Sheng Xu, Zichao Feng, Haodong Zhu, Yuguang Yang, Juan Zhang, Runqi Wang, Baochang Zhang

PDF

Open Access

TL;DR

Squeeze10-LLM introduces a staged mixed-precision quantization framework that compresses large language models by 10 times, maintaining high accuracy with ultra low-bit weights through innovative techniques like PBAR and FIAS.

Contribution

It presents a novel post-training quantization method achieving 1.6 bits per weight, significantly improving low-bit LLM performance with two key innovations, PBAR and FIAS.

Findings

01

Achieves state-of-the-art sub-2bit quantization performance on LLaMA models.

02

Improves zero-shot classification accuracy from 43% to 56%.

03

Quantizes 80% of weights to 1 bit and 20% to 4 bits.

Abstract

Deploying large language models (LLMs) is challenging due to their massive parameters and high computational costs. Ultra low-bit quantization can significantly reduce storage and accelerate inference, but extreme compression (i.e., mean bit-width <= 2) often leads to severe performance degradation. To address this, we propose Squeeze10-LLM, effectively "squeezing" 16-bit LLMs' weights by 10 times. Specifically, Squeeze10-LLM is a staged mixed-precision post-training quantization (PTQ) framework and achieves an average of 1.6 bits per weight by quantizing 80% of the weights to 1 bit and 20% to 4 bits. We introduce Squeeze10LLM with two key innovations: Post-Binarization Activation Robustness (PBAR) and Full Information Activation Supervision (FIAS). PBAR is a refined weight significance metric that accounts for the impact of quantization on activations, improving accuracy in low-bit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsManufacturing Process and Optimization · Industrial Vision Systems and Defect Detection · Metallurgy and Material Forming