StreamingTOM: Streaming Token Compression for Efficient Video Understanding

Xueyi Chen; Keda Tao; Kele Shao; Huan Wang

arXiv:2510.18269·cs.CV·March 17, 2026

StreamingTOM: Streaming Token Compression for Efficient Video Understanding

Xueyi Chen, Keda Tao, Kele Shao, Huan Wang

PDF

Open Access

TL;DR

StreamingTOM introduces a training-free, two-stage framework that significantly reduces memory and computational costs in streaming video understanding while maintaining high accuracy, enabling real-time processing without retraining.

Contribution

It proposes a novel, plug-and-play approach combining causal temporal reduction and quantized memory to address streaming constraints, improving efficiency and memory bounds without retraining.

Findings

01

Achieves 15.7x kv-cache compression ratio.

02

Reduces peak memory by 1.2x and doubles TTFT speed compared to SOTA.

03

Maintains high accuracy with 63.8% on offline benchmarks.

Abstract

Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens, ensuring predictable latency. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning