DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing
Jinyu Guo, Zhihan Zhang, Jiehui Xie, Md. Tamim Iqbal, Dongshen Han, Lik-Hang Lee, Sung-Ho Bae, Jie Zou, Yang Yang, Chaoning Zhang

TL;DR
DASH-KV is a novel framework that accelerates long-context large language model inference by reformulating attention as approximate nearest-neighbor search with asymmetric deep hashing, reducing complexity from quadratic to linear.
Contribution
It introduces an asymmetric encoding architecture and a dynamic mixed-precision mechanism to improve efficiency without sacrificing accuracy in long-context inference.
Findings
DASH-KV significantly outperforms baseline methods on LongBench.
It reduces inference complexity from O(N^2) to O(N).
It matches full attention performance while being more efficient.
Abstract
The quadratic computational complexity of the standard attention mechanism constitutes a fundamental bottleneck for large language models in long-context inference. While existing KV cache compression methods alleviate memory pressure, they often sacrifice generation quality and fail to address the high overhead of floating-point arithmetic. This paper introduces DASH-KV, an innovative acceleration framework that reformulates attention as approximate nearest-neighbor search via asymmetric deep hashing. Under this paradigm, we design an asymmetric encoding architecture that differentially maps queries and keys to account for their distinctions in precision and reuse characteristics. To balance efficiency and accuracy, we further introduce a dynamic mixed-precision mechanism that adaptively retains full-precision computation for critical tokens. Extensive experiments on LongBench…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
