On the Compressibility of Quantized Large Language Models

Yu Mao; Weilan Wang; Hongchao Du; Nan Guan; and Chun Jason Xue

arXiv:2403.01384·cs.LG·May 7, 2024·1 cites

On the Compressibility of Quantized Large Language Models

Yu Mao, Weilan Wang, Hongchao Du, Nan Guan, and Chun Jason Xue

PDF

Open Access

TL;DR

This paper investigates how data compression can reduce data movement and improve inference speed of quantized large language models on memory-limited devices, addressing I/O bottlenecks.

Contribution

It provides a preliminary analysis of the compressibility of quantized LLMs and explores trade-offs between compression and model performance.

Findings

01

Quantized LLMs can be further compressed to reduce data transfer.

02

Compression impacts the performance and efficiency of LLM inference.

03

Potential for joint optimization of compression and model accuracy.

Abstract

Deploying Large Language Models (LLMs) on edge or mobile devices offers significant benefits, such as enhanced data privacy and real-time processing capabilities. However, it also faces critical challenges due to the substantial memory requirement of LLMs. Quantization is an effective way of reducing the model size while maintaining good performance. However, even after quantization, LLMs may still be too big to fit entirely into the limited memory of edge or mobile devices and have to be partially loaded from the storage to complete the inference. In this case, the I/O latency of model loading becomes the bottleneck of the LLM inference latency. In this work, we take a preliminary step of studying applying data compression techniques to reduce data movement and thus speed up the inference of quantized LLM on memory-constrained devices. In particular, we discussed the compressibility of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings