LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

Zhenyu Ning; Guangda Liu; Qihao Jin; Chengwei Li; Wenchao Ding; Minyi Guo; Jieru Zhao

arXiv:2505.15269·cs.CV·April 24, 2026

LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

Zhenyu Ning, Guangda Liu, Qihao Jin, Chengwei Li, Wenchao Ding, Minyi Guo, Jieru Zhao

PDF

TL;DR

LiveVLM introduces a real-time, training-free framework for online video understanding that reduces memory and delay issues in Video LLMs through innovative KV cache management and retrieval mechanisms.

Contribution

It proposes LiveVLM, a novel framework combining Vision Sink Bucketing and Position-agnostic KV Retrieval for efficient, online, query-agnostic video processing without additional training.

Findings

01

Achieves state-of-the-art accuracy among training-free query-agnostic methods.

02

Reduces memory overhead and response delay in online video understanding.

03

Supports real-time interaction in applications like autonomous driving and robotics.

Abstract

Recent developments in Video Large Language Models (Video LLMs) have enabled models to process hour-long videos and exhibit exceptional performance. Nonetheless, the Key-Value (KV) cache expands linearly over time, leading to substantial memory overhead and response delay--critical challenges in various real-world online applications, such as Deepseek services, autonomous driving and robotics. To mitigate these issues, we propose $LiveVLM$ , a training-free and query-agnostic framework specifically designed for online video understanding and real-time interaction. LiveVLM employs a Vision Sink Bucketing (VSB) mechanism to process video streams in real time, retain long-term video details and eliminate redundant KVs. This mechanism utilizes vision-to-vision attention scores as the metric and seeks to maximize the coverage of contextual information during compression. Noting that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.