StreamingTOM: Streaming Token Compression for Efficient Video Understanding
Xueyi Chen, Keda Tao, Kele Shao, Huan Wang

TL;DR
StreamingTOM introduces a training-free, two-stage framework that significantly reduces memory and computational costs in streaming video understanding while maintaining high accuracy, enabling real-time processing without retraining.
Contribution
It proposes a novel, plug-and-play approach combining causal temporal reduction and quantized memory to address streaming constraints, improving efficiency and memory bounds without retraining.
Findings
Achieves 15.7x kv-cache compression ratio.
Reduces peak memory by 1.2x and doubles TTFT speed compared to SOTA.
Maintains high accuracy with 63.8% on offline benchmarks.
Abstract
Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens, ensuring predictable latency. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
