Dual-Signal Adaptive KV-Cache Optimization for Long-Form Video Understanding in Vision-Language Models

Vishnu Sai; Dheeraj Sai; Srinath B; Girish Varma; and Priyesh Shukla

arXiv:2602.14236·cs.CV·February 17, 2026

Dual-Signal Adaptive KV-Cache Optimization for Long-Form Video Understanding in Vision-Language Models

Vishnu Sai, Dheeraj Sai, Srinath B, Girish Varma, and Priyesh Shukla

PDF

Open Access

TL;DR

Sali-Cache is a proactive memory management framework for vision-language models that uses dual-signal filters to optimize KV-cache usage, enabling efficient long-form video understanding without accuracy loss.

Contribution

It introduces a novel a priori caching strategy combining temporal and spatial filters, reducing memory usage while maintaining model accuracy in long video processing.

Findings

01

Achieves 2.20x memory compression ratio.

02

Maintains 100% accuracy on BLEU, ROUGE-L, and Exact Match metrics.

03

Enables long-form video processing on consumer hardware.

Abstract

Vision-Language Models (VLMs) face a critical memory bottleneck when processing long-form video content due to the linear growth of the Key-Value (KV) cache with sequence length. Existing solutions predominantly employ reactive eviction strategies that compute full attention matrices before discarding tokens, resulting in substantial computational waste. We propose Sali-Cache, a novel a priori optimization framework that implements dual-signal adaptive caching through proactive memory management. By integrating a temporal filter based on optical flow analysis for detecting inter-frame redundancy and a spatial filter leveraging saliency detection for identifying visually significant regions, Sali-Cache intelligently manages memory allocation before entering computationally expensive attention operations. Experimental evaluation on the LLaVA 1.6 architecture demonstrates that our method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Caching and Content Delivery · Multimodal Machine Learning Applications