Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing   LLMs Beyond Integer Bit-Levels

Razvan-Gabriel Dumitru; Vikas Yadav; Rishabh Maheshwary; Paul-Ioan; Clotan; Sathwik Tejaswi Madhusudhan; Mihai Surdeanu

arXiv:2406.17415·cs.CL·October 29, 2024

Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels

Razvan-Gabriel Dumitru, Vikas Yadav, Rishabh Maheshwary, Paul-Ioan, Clotan, Sathwik Tejaswi Madhusudhan, Mihai Surdeanu

PDF

Open Access 1 Repo 10 Models

TL;DR

This paper introduces a layer-wise quantization method for large language models that assigns different bit levels to layers based on their importance, achieving significant compression with minimal performance loss.

Contribution

The paper proposes a novel importance-based layer-wise quantization strategy that is independent of the underlying quantization technique and improves model compression efficiency.

Findings

01

Layer importance can be effectively measured by output-input embedding differences.

02

Quantizing layers based on importance scores maintains performance with 25-50% layers at lower bits.

03

Layer-wise quantization outperforms pruning unless extreme 2-bit quantization is used.

Abstract

We present a simple meta quantization approach that quantizes different layers of a large language model (LLM) at different bit levels, and is independent of the underlying quantization technique. Specifically, we quantize the most important layers to higher bit precision and less important layers to lower bits. We propose two effective strategies to measure the importance of layers within LLMs: the first measures the importance of a layer based on how different its output embeddings are from the input embeddings (higher is better); the second estimates the importance of a layer using the number of layer weights that are much larger than average (smaller is better). We show that quantizing different layers at varying bits according to our importance scores results in minimal performance drop with a far more compressed model size. Finally, we present several practical key takeaways from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

razvandu/layerwisequant
noneOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvancements in Photolithography Techniques · Semiconductor materials and devices · Advancements in Semiconductor Devices and Circuit Design

MethodsPruning