Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels
Razvan-Gabriel Dumitru, Vikas Yadav, Rishabh Maheshwary, Paul-Ioan, Clotan, Sathwik Tejaswi Madhusudhan, Mihai Surdeanu

TL;DR
This paper introduces a layer-wise quantization method for large language models that assigns different bit levels to layers based on their importance, achieving significant compression with minimal performance loss.
Contribution
The paper proposes a novel importance-based layer-wise quantization strategy that is independent of the underlying quantization technique and improves model compression efficiency.
Findings
Layer importance can be effectively measured by output-input embedding differences.
Quantizing layers based on importance scores maintains performance with 25-50% layers at lower bits.
Layer-wise quantization outperforms pruning unless extreme 2-bit quantization is used.
Abstract
We present a simple meta quantization approach that quantizes different layers of a large language model (LLM) at different bit levels, and is independent of the underlying quantization technique. Specifically, we quantize the most important layers to higher bit precision and less important layers to lower bits. We propose two effective strategies to measure the importance of layers within LLMs: the first measures the importance of a layer based on how different its output embeddings are from the input embeddings (higher is better); the second estimates the importance of a layer using the number of layer weights that are much larger than average (smaller is better). We show that quantizing different layers at varying bits according to our importance scores results in minimal performance drop with a far more compressed model size. Finally, we present several practical key takeaways from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗eaddario/Hammer2.1-7b-GGUFmodel· 360 dl· ♡ 2360 dl♡ 2
- 🤗eaddario/DeepSeek-R1-Distill-Qwen-7B-GGUFmodel· 2.1k dl· ♡ 32.1k dl♡ 3
- 🤗eaddario/Watt-Tool-8B-GGUFmodel· 562 dl· ♡ 5562 dl♡ 5
- 🤗eaddario/DeepSeek-R1-Distill-Llama-8B-GGUFmodel· 324 dl· ♡ 1324 dl♡ 1
- 🤗eaddario/Dolphin3.0-R1-Mistral-24B-GGUFmodel· 351 dl· ♡ 1351 dl♡ 1
- 🤗eaddario/Llama-Guard-3-8B-GGUFmodel· 489 dl489 dl
- 🤗eaddario/Dolphin3.0-Mistral-24B-GGUFmodel· 223 dl· ♡ 2223 dl♡ 2
- 🤗eaddario/Llama-xLAM-2-8b-fc-r-GGUFmodel· 92 dl· ♡ 192 dl♡ 1
- 🤗eaddario/Qwen3-8B-GGUFmodel· 203 dl· ♡ 1203 dl♡ 1
- 🤗eaddario/OLMo-2-1124-7B-Instruct-GGUFmodel· 53 dl53 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvancements in Photolithography Techniques · Semiconductor materials and devices · Advancements in Semiconductor Devices and Circuit Design
MethodsPruning
