An Efficient Streaming Video Understanding Framework with Agentic Control

Jinming Liu,Jianguo Huang,Zhaoyang Jia,Jiahao Li,Xiaoyi Zhang,Zongyu Guo,Bin Li,Wenjun Zeng,Yan Lu,Xin Jin

arXiv:2605.17921·cs.CV·May 19, 2026

An Efficient Streaming Video Understanding Framework with Agentic Control

Jinming Liu,Jianguo Huang,Zhaoyang Jia,Jiahao Li,Xiaoyi Zhang,Zongyu Guo,Bin Li,Wenjun Zeng,Yan Lu,Xin Jin

PDF

TL;DR

The paper introduces R3-Streaming, a dynamic control framework for streaming video understanding that optimizes memory, response readiness, and computation routing to improve efficiency and accuracy under latency constraints.

Contribution

It presents a novel cascaded control approach with age-aware memory compression and a reinforcement learning-based compute routing method for streaming video understanding.

Findings

01

Achieves state-of-the-art results on OVO-Bench and StreamingBench datasets.

02

Reduces visual token usage by 95 to 96 percent.

03

Effectively balances model complexity and latency in streaming video tasks.

Abstract

Streaming video requires handling dynamic information density under strict latency budgets. Yet, existing methods typically employ static strategies, such as fixed memory compression or reliance on a single model, forcing a trade-off: fast models fail on complex queries, while always-on heavy models violate real-time constraints and overcomplicate simple queries. Rather than fixing these decisions upfront, we propose R3-Streaming (Remember, Respond, Reason), which formulates streaming video understanding as a cascaded control problem: for each query, the system compresses memory, judges response readiness, and routes computation sequentially, so that each downstream decision builds on progressively refined information states. To optimize this pipeline, we introduce an age-aware forgetting policy for memory compression, as aggressively compressing historical frames can yield substantial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.