TL;DR
HERMES introduces a training-free hierarchical KV cache architecture that enables real-time, resource-efficient streaming video understanding with significant speed and accuracy improvements.
Contribution
It proposes a novel hierarchical KV cache framework for streaming video understanding that requires no additional training and enhances efficiency and accuracy.
Findings
Achieves 10× faster TTFT than prior SOTA methods.
Maintains superior or comparable accuracy with up to 68% reduction in video tokens.
Delivers up to 11.4% accuracy gains on streaming video benchmarks.
Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
