HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Haowei Zhang; Shudong Yang; Jinlan Fu; See-Kiong Ng; Xipeng Qiu

arXiv:2601.14724·cs.CV·May 8, 2026

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, Xipeng Qiu

PDF

1 Repo

TL;DR

HERMES introduces a training-free hierarchical KV cache architecture that enables real-time, resource-efficient streaming video understanding with significant speed and accuracy improvements.

Contribution

It proposes a novel hierarchical KV cache framework for streaming video understanding that requires no additional training and enhances efficiency and accuracy.

Findings

01

Achieves 10× faster TTFT than prior SOTA methods.

02

Maintains superior or comparable accuracy with up to 68% reduction in video tokens.

03

Delivers up to 11.4% accuracy gains on streaming video benchmarks.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

haowei-freesky/HERMES
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.