Palu: Compressing KV-Cache with Low-Rank Projection
Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang, Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S. Abdelfattah, and, Kai-Chiang Wu

TL;DR
Palu is a low-rank projection framework that compresses KV-Cache in large language models, reducing memory usage by 50% and improving inference speed while maintaining accuracy, through innovative decomposition, search, and GPU optimization techniques.
Contribution
Palu introduces a novel low-rank projection approach for KV-Cache compression, enhancing efficiency and accuracy in LLM inference beyond existing methods.
Findings
Compresses KV-Cache by 50% while maintaining accuracy.
Achieves up to 1.89x speedup in RoPE-based attention.
Combined with quantization, yields up to 2.91x memory savings.
Abstract
Post-training KV-Cache compression methods typically either sample a subset of effectual tokens or quantize the data into lower numerical bit width. However, these methods cannot exploit redundancy in the hidden dimension of the KV tensors. This paper presents a hidden dimension compression approach called Palu, a KV-Cache compression framework that utilizes low-rank projection to reduce inference-time LLM memory usage. Palu decomposes the linear layers into low-rank matrices, caches compressed intermediate states, and reconstructs the full keys and values on the fly. To improve accuracy, compression rate, and efficiency, Palu further encompasses (1) a medium-grained low-rank decomposition scheme, (2) an efficient rank search algorithm, (3) low-rank-aware quantization compatibility enhancements, and (4) optimized GPU kernels with operators fusion. Extensive experiments with popular LLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmbedded Systems Design Techniques · Interconnection Networks and Systems · Parallel Computing and Optimization Techniques
MethodsSoftmax · Attention Is All You Need
