EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

Shu-Hao Zhang; Le-Tong Huang; Xiang-Sheng Deng; Xin-Yi Zou; Chen Wu; Nan Li; Shao-Qun Zhang; Zhi-Hua Zhou

arXiv:2605.04062·cs.LG·May 22, 2026

EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

Shu-Hao Zhang, Le-Tong Huang, Xiang-Sheng Deng, Xin-Yi Zou, Chen Wu, Nan Li, Shao-Qun Zhang, Zhi-Hua Zhou

PDF

1 Repo 15 Models

TL;DR

EdgeRazor is a novel lightweight framework that combines mixed-precision quantization and distillation techniques to efficiently compress large language models with minimal performance loss.

Contribution

It introduces a three-module framework for mixed-precision quantization-aware distillation, improving compression and efficiency of LLMs beyond existing methods.

Findings

01

Outperforms 2-bit baselines by 11.27% in accuracy.

02

Reduces storage from 1.11 GB to 0.19 GB at 1.58-bit precision.

03

Achieves 15.16× faster decoding over 16-bit baseline.

Abstract

Quantization has emerged as a mainstream approach for deploying Large Language Models (LLMs) on resource-constrained devices, yet compressing precision below 4-bit typically causes severe performance degradation or prohibitive retraining costs. In this paper, we propose EdgeRazor, a lightweight framework for LLMs via Mixed-Precision Quantization-Aware Distillation. It contains three modules: Structural Quantization with Mixed Precision for fine-grained control of bit-widths, Layer-Adaptive Feature Distillation that dynamically selects the most informative features for alignment, and Entropy-Aware KL Divergence for forward-reverse balance on both human-annotated and distilled datasets. Evaluations conducted on MobileLLM and Qwen families show that under weight-activation quantization, the 1.88-bit Qwen3-0.6B-EdgeRazor outperforms the state-of-the-art 2-bit baselines by 11.27 and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhangsq-nju/EdgeRazor
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.