AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization

Zhonghua Jiang; Kui Chen; Kunxi Li; Keting Yin; Yiyun Zhou; Zhaode Wang; Chengfei Lv; Shengyu Zhang

arXiv:2511.11106·cs.MM·November 17, 2025

AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization

Zhonghua Jiang, Kui Chen, Kunxi Li, Keting Yin, Yiyun Zhou, Zhaode Wang, Chengfei Lv, Shengyu Zhang

PDF

Open Access

TL;DR

AccKV introduces an adaptive focusing and cross-calibration framework to optimize key-value cache management in audio-video LLMs, significantly enhancing inference efficiency without sacrificing accuracy.

Contribution

This paper proposes a novel AccKV framework that adaptively focuses on key modalities and calibrates cross-modal caches to improve AV-LLMs inference efficiency.

Findings

01

Significant reduction in computational cost during AV-LLMs inference.

02

Maintained or improved accuracy with optimized cache management.

03

Effective layer-specific focusing enhances recognition of important tokens.

Abstract

Recent advancements in Audio-Video Large Language Models (AV-LLMs) have enhanced their capabilities in tasks like audio-visual question answering and multimodal dialog systems. Video and audio introduce an extended temporal dimension, resulting in a larger key-value (KV) cache compared to static image embedding. A naive optimization strategy is to selectively focus on and retain KV caches of audio or video based on task. However, in the experiment, we observed that the attention of AV-LLMs to various modalities in the high layers is not strictly dependent on the task. In higher layers, the attention of AV-LLMs shifts more towards the video modality. In addition, we also found that directly integrating temporal KV of audio and spatial-temporal KV of video may lead to information confusion and significant performance degradation of AV-LLMs. If audio and video are processed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling