Mosaic: Cross-Modal Clustering for Efficient Video Understanding
Tuowei Wang, He Zhou, Chengru Song, Qiushi Li, Ju Ren

TL;DR
Mosaic introduces a cluster-driven approach to manage KVCache in streaming long-video understanding, improving efficiency and reducing overhead in vision-language models.
Contribution
It is the first system to leverage cross-modal clustering for KVCache management, enhancing streaming video inference performance.
Findings
Achieves up to 1.38x speedup over baselines.
Utilizes cross-modal clusters to organize KVCache efficiently.
Outperforms existing retrieval-based approaches in long-video understanding.
Abstract
Large vision-language models (VLMs) are enabling interactive video reasoning, giving rise to streaming long-video understanding. In this setting, frames arrive continuously, while the system preserves long-term context and generates responses under strict latency constraints. A central challenge is KVCache management: as video streams grow, KVCache expands rapidly, increasing computation and memory overhead. Existing retrieval-based approaches exploit attention sparsity and offload inactive KVCache from GPU to CPU memory, but their token-level design causes high management overhead and fragmented data movement. We present Mosaic, the first cluster-driven VLM inference system for streaming long-video understanding. Our key insight is that VLM KVCache exhibits an implicit cross-modal clustering structure: retrieved KV states form groups jointly shaped by visual coherence and semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
