HACK: Homomorphic Acceleration via Compression of the Key-Value Cache   for Disaggregated LLM Inference

Zeyu Zhang; Haiying Shen; Shay Vargaftik; Ran Ben Basat; Michael; Mitzenmacher; Minlan Yu

arXiv:2502.03589·cs.DC·February 7, 2025

HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference

Zeyu Zhang, Haiying Shen, Shay Vargaftik, Ran Ben Basat, Michael, Mitzenmacher, Minlan Yu

PDF

Open Access

TL;DR

This paper introduces HACK, a method that accelerates disaggregated LLM inference by directly computing on quantized Key-Value data, significantly reducing job completion time without heavy dequantization overhead.

Contribution

HACK enables direct computation on quantized KV data, eliminating dequantization and improving inference speed for disaggregated LLMs.

Findings

01

Reduces JCT by up to 70.9% compared to baseline

02

Achieves up to 52.3% reduction over existing quantization methods

03

Effective on long prompts and sequences

Abstract

Disaggregated Large Language Model (LLM) inference has gained popularity as it separates the computation-intensive prefill stage from the memory-intensive decode stage, avoiding the prefill-decode interference and improving resource utilization. However, transmitting Key-Value (KV) data between the two stages can be a bottleneck, especially for long prompts. Additionally, the computation time overhead for prefill and decode is key for optimizing Job Completion Time (JCT), and KV data size can become prohibitive for long prompts and sequences. Existing KV quantization methods can alleviate the transmission bottleneck and reduce memory requirements, but they introduce significant dequantization overhead, exacerbating the computation time. We propose Homomorphic Acceleration via Compression of the KV cache (HACK) for disaggregated LLM inference. HACK eliminates the heavy KV…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Topic Modeling · Natural Language Processing Techniques