FDC: Fast KV Dimensionality Compression for Efficient LLM Inference

Zeyu Zhang; Haiying Shen

arXiv:2408.04107·cs.LG·June 10, 2025

FDC: Fast KV Dimensionality Compression for Efficient LLM Inference

Zeyu Zhang, Haiying Shen

PDF

Open Access

TL;DR

FDC is a novel system that accelerates large-language model inference by efficiently compressing key-value pairs without decompression overhead, adapting compression rates dynamically, and balancing workloads to reduce latency and improve throughput.

Contribution

FDC introduces a fast, adaptive KV dimensionality compression method that eliminates decompression overhead and enhances inference efficiency in large-language models.

Findings

01

Reduces Job Completion Time by up to 64%

02

Achieves up to 1.97X throughput at the same latency

03

Maintains 99% of original accuracy without compression

Abstract

In large-language models, memory constraints in the Key-Value Cache (KVC) pose a challenge during inference. In this work, we propose FDC, a fast KV dimensionality compression system that eliminates the decompression overhead incurred in the existing KV dimensionality compression system, Palu, and reduces attention time. Moreover, FDC employs adaptive compression, tailoring KV compression rates across heads and layers based on their contributions to inference to maximize overall compression while maintaining an accuracy loss constraint. Additionally, FDC enhances the attention kernel to balance the uneven workloads caused by the adaptive compression approach to further reduce attention computation latency. Comprehensive experiments demonstrate that compared to Palu, FDC can reduce Job Completion Time (JCT) by up to 64%, and delivers up to 1.97X throughput under the same latency, while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Caching and Content Delivery · Algorithms and Data Compression

MethodsSoftmax · Attention Is All You Need