Mosaic: Cross-Modal Clustering for Efficient Video Understanding

Tuowei Wang; He Zhou; Chengru Song; Qiushi Li; Ju Ren

arXiv:2604.10060·cs.PF·April 14, 2026

Mosaic: Cross-Modal Clustering for Efficient Video Understanding

Tuowei Wang, He Zhou, Chengru Song, Qiushi Li, Ju Ren

PDF

TL;DR

Mosaic introduces a cluster-driven approach to manage KVCache in streaming long-video understanding, improving efficiency and reducing overhead in vision-language models.

Contribution

It is the first system to leverage cross-modal clustering for KVCache management, enhancing streaming video inference performance.

Findings

01

Achieves up to 1.38x speedup over baselines.

02

Utilizes cross-modal clusters to organize KVCache efficiently.

03

Outperforms existing retrieval-based approaches in long-video understanding.

Abstract

Large vision-language models (VLMs) are enabling interactive video reasoning, giving rise to streaming long-video understanding. In this setting, frames arrive continuously, while the system preserves long-term context and generates responses under strict latency constraints. A central challenge is KVCache management: as video streams grow, KVCache expands rapidly, increasing computation and memory overhead. Existing retrieval-based approaches exploit attention sparsity and offload inactive KVCache from GPU to CPU memory, but their token-level design causes high management overhead and fragmented data movement. We present Mosaic, the first cluster-driven VLM inference system for streaming long-video understanding. Our key insight is that VLM KVCache exhibits an implicit cross-modal clustering structure: retrieved KV states form groups jointly shaped by visual coherence and semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.