When Compression Meets Model Compression: Memory-Efficient Double   Compression for Large Language Models

Weilan Wang; Yu Mao; Dongdong Tang; Hongchao Du; Nan Guan; Chun Jason; Xue

arXiv:2502.15443·cs.CL·February 24, 2025

When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models

Weilan Wang, Yu Mao, Dongdong Tang, Hongchao Du, Nan Guan, Chun Jason, Xue

PDF

1 Video

TL;DR

This paper presents a novel double compression framework for large language models that combines quantization and pruning to significantly reduce memory usage with minimal impact on performance.

Contribution

It introduces a compression-aware quantization and a speed-adaptive decompression method, achieving about 2.2x compression and 40% memory reduction for LLMs.

Findings

01

Achieves 2.2x compression ratio with negligible accuracy loss.

02

Reduces memory size by 40% during inference.

03

Provides a trade-off analysis between memory and latency.

Abstract

Large language models (LLMs) exhibit excellent performance in various tasks. However, the memory requirements of LLMs present a great challenge when deploying on memory-limited devices, even for quantized LLMs. This paper introduces a framework to compress LLM after quantization further, achieving about 2.2x compression ratio. A compression-aware quantization is first proposed to enhance model weight compressibility by re-scaling the model parameters before quantization, followed by a pruning method to improve further. Upon this, we notice that decompression can be a bottleneck during practical scenarios. We then give a detailed analysis of the trade-off between memory usage and latency brought by the proposed method. A speed-adaptive method is proposed to overcome it. The experimental results show inference with the compressed model can achieve a 40% reduction in memory size with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models· underline

Taxonomy

MethodsPruning