LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment
Binrui Zeng, Bin Ji, Xiaodong Liu, Jie Yu, Shasha Li, Jun Ma, Xiaopeng, Li, Shangwen Wang, Xinran Hong, Yongtao Tang

TL;DR
LSAQ introduces a layer-specific adaptive quantization method that dynamically adjusts precision based on layer importance, enabling efficient deployment of large language models on resource-constrained edge devices.
Contribution
The paper presents a novel adaptive quantization system that evaluates layer importance and adjusts quantization strategies in real time for LLM deployment on edge devices.
Findings
Outperforms baseline quantization methods in perplexity and zero-shot tasks
Adapts quantization schemes for different deployment scenarios
Enables efficient LLM deployment on resource-limited devices
Abstract
As Large Language Models (LLMs) demonstrate exceptional performance across various domains, deploying LLMs on edge devices has emerged as a new trend. Quantization techniques, which reduce the size and memory requirements of LLMs, are effective for deploying LLMs on resource-limited edge devices. However, existing one-size-fits-all quantization methods often fail to dynamically adjust the memory requirements of LLMs, limiting their applications to practical edge devices with various computation resources. To tackle this issue, we propose Layer-Specific Adaptive Quantization (LSAQ), a system for adaptive quantization and dynamic deployment of LLMs based on layer importance. Specifically, LSAQ evaluates the importance of LLMs' neural layers by constructing top-k token sets from the inputs and outputs of each layer and calculating their Jaccard similarity. Based on layer importance, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
