LSAQ: Layer-Specific Adaptive Quantization for Large Language Model   Deployment

Binrui Zeng; Bin Ji; Xiaodong Liu; Jie Yu; Shasha Li; Jun Ma; Xiaopeng; Li; Shangwen Wang; Xinran Hong; Yongtao Tang

arXiv:2412.18135·cs.CL·May 7, 2025

LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment

Binrui Zeng, Bin Ji, Xiaodong Liu, Jie Yu, Shasha Li, Jun Ma, Xiaopeng, Li, Shangwen Wang, Xinran Hong, Yongtao Tang

PDF

Open Access

TL;DR

LSAQ introduces a layer-specific adaptive quantization method that dynamically adjusts precision based on layer importance, enabling efficient deployment of large language models on resource-constrained edge devices.

Contribution

The paper presents a novel adaptive quantization system that evaluates layer importance and adjusts quantization strategies in real time for LLM deployment on edge devices.

Findings

01

Outperforms baseline quantization methods in perplexity and zero-shot tasks

02

Adapts quantization schemes for different deployment scenarios

03

Enables efficient LLM deployment on resource-limited devices

Abstract

As Large Language Models (LLMs) demonstrate exceptional performance across various domains, deploying LLMs on edge devices has emerged as a new trend. Quantization techniques, which reduce the size and memory requirements of LLMs, are effective for deploying LLMs on resource-limited edge devices. However, existing one-size-fits-all quantization methods often fail to dynamically adjust the memory requirements of LLMs, limiting their applications to practical edge devices with various computation resources. To tackle this issue, we propose Layer-Specific Adaptive Quantization (LSAQ), a system for adaptive quantization and dynamic deployment of LLMs based on layer importance. Specifically, LSAQ evaluates the importance of LLMs' neural layers by constructing top-k token sets from the inputs and outputs of each layer and calculating their Jaccard similarity. Based on layer importance, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques