FDC: Fast KV Dimensionality Compression for Efficient LLM Inference
Zeyu Zhang, Haiying Shen

TL;DR
FDC is a novel system that accelerates large-language model inference by efficiently compressing key-value pairs without decompression overhead, adapting compression rates dynamically, and balancing workloads to reduce latency and improve throughput.
Contribution
FDC introduces a fast, adaptive KV dimensionality compression method that eliminates decompression overhead and enhances inference efficiency in large-language models.
Findings
Reduces Job Completion Time by up to 64%
Achieves up to 1.97X throughput at the same latency
Maintains 99% of original accuracy without compression
Abstract
In large-language models, memory constraints in the Key-Value Cache (KVC) pose a challenge during inference. In this work, we propose FDC, a fast KV dimensionality compression system that eliminates the decompression overhead incurred in the existing KV dimensionality compression system, Palu, and reduces attention time. Moreover, FDC employs adaptive compression, tailoring KV compression rates across heads and layers based on their contributions to inference to maximize overall compression while maintaining an accuracy loss constraint. Additionally, FDC enhances the attention kernel to balance the uneven workloads caused by the adaptive compression approach to further reduce attention computation latency. Comprehensive experiments demonstrate that compared to Palu, FDC can reduce Job Completion Time (JCT) by up to 64%, and delivers up to 1.97X throughput under the same latency, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Caching and Content Delivery · Algorithms and Data Compression
MethodsSoftmax · Attention Is All You Need
