InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

Minsoo Kim; Kyuhong Shim; Jungwook Choi; Simyung Chang

arXiv:2506.15745·eess.IV·October 27, 2025

InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

Minsoo Kim, Kyuhong Shim, Jungwook Choi, Simyung Chang

PDF

Open Access 1 Video

TL;DR

InfiniPot-V introduces a training-free, query-agnostic method to compress KV caches in streaming video models, significantly reducing memory usage while maintaining accuracy, enabling real-time on-device video understanding.

Contribution

It presents the first length-independent, training-free KV cache compression framework for streaming video models that does not require prior knowledge of the entire video or queries.

Findings

01

Reduces peak GPU memory by up to 94%

02

Maintains or improves accuracy compared to full-cache models

03

Supports real-time streaming video understanding on edge devices

Abstract

Modern multimodal large language models (MLLMs) can reason over hour-long video, yet their key-value (KV) cache grows linearly with time-quickly exceeding the fixed memory of phones, AR glasses, and edge robots. Prior compression schemes either assume the whole video and user query are available offline or must first build the full cache, so memory still scales with stream length. InfiniPot-V is the first training-free, query-agnostic framework that enforces a hard, length-independent memory cap for streaming video understanding. During video encoding it monitors the cache and, once a user-set threshold is reached, runs a lightweight compression pass that (i) removes temporally redundant tokens via Temporal-axis Redundancy (TaR) metric and (ii) keeps semantically significant tokens via Value-Norm (VaN) ranking. Across four open-source MLLMs and four long-video and streaming-video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition