CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding

Ailar Mahdizadeh; Puria Azadi; Muchen Li; Xiangteng He; Leonid Sigal

arXiv:2605.14310·cs.CV·May 15, 2026

CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding

Ailar Mahdizadeh, Puria Azadi, Muchen Li, Xiangteng He, Leonid Sigal

PDF

TL;DR

This paper introduces a coreset-based approach for compressing visual memory in streaming video understanding, outperforming heuristic methods by selecting representative, diverse visual tokens to support future reasoning.

Contribution

It formulates KV-cache compression as a coreset selection problem with a bicriteria objective and diversity criterion, improving memory efficiency in streaming video models.

Findings

01

Outperforms heuristic baselines on multiple benchmarks

02

Improves memory efficiency with a small cache budget

03

Enhances diversity and representativeness of retained tokens

Abstract

Streaming video understanding with large vision-language models (VLMs) requires a compact memory that can support future reasoning over an ever-growing visual history. A common solution is to compress the key-value (KV) cache, but existing streaming methods typically rely on local token-wise heuristics, such as recency, temporal redundancy, or saliency, which do not explicitly optimize whether the retained cache is representative of the accumulated history. We propose to view KV-cache compression as a coreset selection problem: rather than scoring tokens independently for retention, we select a small subset that covers the geometry of the accumulated visual cache. Our method operates in a joint KV representation and introduces a bicriteria objective that balances coverage in key and value spaces, preserving both retrieval structure and output-relevant information. To encourage a more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.