AudioKV: KV Cache Eviction in Efficient Large Audio Language Models

Yuxuan Wang; Peize He; Xiyan Gui; Xiaoqian Liu; Junhao He; Xuyang Liu; Zichen Wen; Xuming Hu; Linfeng Zhang

arXiv:2604.06694·cs.SD·April 9, 2026

AudioKV: KV Cache Eviction in Efficient Large Audio Language Models

Yuxuan Wang, Peize He, Xiyan Gui, Xiaoqian Liu, Junhao He, Xuyang Liu, Zichen Wen, Xuming Hu, Linfeng Zhang

PDF

TL;DR

AudioKV introduces a novel KV cache management framework for large audio-language models, prioritizing audio-critical attention heads and employing spectral smoothing to improve efficiency and accuracy during long-context inference.

Contribution

The paper presents a new method for KV cache eviction in LALMs that leverages semantic-acoustic alignment and spectral smoothing, outperforming existing compression techniques.

Findings

01

AudioKV maintains near-full accuracy at 40% compression with only 0.45% accuracy drop.

02

It significantly outperforms traditional cache compression methods in audio-language models.

03

The approach enhances computational efficiency while preserving model performance.

Abstract

Large Audio-Language Models (LALMs) have set new benchmarks in speech processing, yet their deployment is hindered by the memory footprint of the Key-Value (KV) cache during long-context inference. While general KV cache compression techniques excel in LLMs, they often fail in the audio domain by overlooking the intrinsic temporal continuity of acoustic signals. To bridge this gap, we propose AudioKV, a novel framework that robustly prioritizes audio-critical attention heads through a hardware-friendly semantic-acoustic alignment mechanism. Specifically, we identify these modality-specialized heads by analyzing attention scores in ASR tasks and dynamically allocate KV cache budgets preferentially to them. Furthermore, we introduce Spectral Score Smoothing (SSS), an FFT-based global filtering strategy designed to suppress high-frequency noise and recover smooth global trends from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.