CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding
Ailar Mahdizadeh, Puria Azadi, Muchen Li, Xiangteng He, Leonid Sigal

TL;DR
This paper introduces a coreset-based approach for compressing visual memory in streaming video understanding, outperforming heuristic methods by selecting representative, diverse visual tokens to support future reasoning.
Contribution
It formulates KV-cache compression as a coreset selection problem with a bicriteria objective and diversity criterion, improving memory efficiency in streaming video models.
Findings
Outperforms heuristic baselines on multiple benchmarks
Improves memory efficiency with a small cache budget
Enhances diversity and representativeness of retained tokens
Abstract
Streaming video understanding with large vision-language models (VLMs) requires a compact memory that can support future reasoning over an ever-growing visual history. A common solution is to compress the key-value (KV) cache, but existing streaming methods typically rely on local token-wise heuristics, such as recency, temporal redundancy, or saliency, which do not explicitly optimize whether the retained cache is representative of the accumulated history. We propose to view KV-cache compression as a coreset selection problem: rather than scoring tokens independently for retention, we select a small subset that covers the geometry of the accumulated visual cache. Our method operates in a joint KV representation and introduces a bicriteria objective that balances coverage in key and value spaces, preserving both retrieval structure and output-relevant information. To encourage a more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
