Palu: Compressing KV-Cache with Low-Rank Projection

Chi-Chih Chang; Wei-Cheng Lin; Chien-Yu Lin; Chong-Yan Chen; Yu-Fang; Hu; Pei-Shuo Wang; Ning-Chi Huang; Luis Ceze; Mohamed S. Abdelfattah; and; Kai-Chiang Wu

arXiv:2407.21118·cs.AI·November 5, 2024

Palu: Compressing KV-Cache with Low-Rank Projection

Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang, Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S. Abdelfattah, and, Kai-Chiang Wu

PDF

Open Access 1 Repo

TL;DR

Palu is a low-rank projection framework that compresses KV-Cache in large language models, reducing memory usage by 50% and improving inference speed while maintaining accuracy, through innovative decomposition, search, and GPU optimization techniques.

Contribution

Palu introduces a novel low-rank projection approach for KV-Cache compression, enhancing efficiency and accuracy in LLM inference beyond existing methods.

Findings

01

Compresses KV-Cache by 50% while maintaining accuracy.

02

Achieves up to 1.89x speedup in RoPE-based attention.

03

Combined with quantization, yields up to 2.91x memory savings.

Abstract

Post-training KV-Cache compression methods typically either sample a subset of effectual tokens or quantize the data into lower numerical bit width. However, these methods cannot exploit redundancy in the hidden dimension of the KV tensors. This paper presents a hidden dimension compression approach called Palu, a KV-Cache compression framework that utilizes low-rank projection to reduce inference-time LLM memory usage. Palu decomposes the linear layers into low-rank matrices, caches compressed intermediate states, and reconstructs the full keys and values on the fly. To improve accuracy, compression rate, and efficiency, Palu further encompasses (1) a medium-grained low-rank decomposition scheme, (2) an efficient rank search algorithm, (3) low-rank-aware quantization compatibility enhancements, and (4) optimized GPU kernels with operators fusion. Extensive experiments with popular LLMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shadowpa0327/Palu
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmbedded Systems Design Techniques · Interconnection Networks and Systems · Parallel Computing and Optimization Techniques

MethodsSoftmax · Attention Is All You Need