An Efficient Streaming Video Understanding Framework with Agentic Control
Jinming Liu,Jianguo Huang,Zhaoyang Jia,Jiahao Li,Xiaoyi Zhang,Zongyu Guo,Bin Li,Wenjun Zeng,Yan Lu,Xin Jin

TL;DR
The paper introduces R3-Streaming, a dynamic control framework for streaming video understanding that optimizes memory, response readiness, and computation routing to improve efficiency and accuracy under latency constraints.
Contribution
It presents a novel cascaded control approach with age-aware memory compression and a reinforcement learning-based compute routing method for streaming video understanding.
Findings
Achieves state-of-the-art results on OVO-Bench and StreamingBench datasets.
Reduces visual token usage by 95 to 96 percent.
Effectively balances model complexity and latency in streaming video tasks.
Abstract
Streaming video requires handling dynamic information density under strict latency budgets. Yet, existing methods typically employ static strategies, such as fixed memory compression or reliance on a single model, forcing a trade-off: fast models fail on complex queries, while always-on heavy models violate real-time constraints and overcomplicate simple queries. Rather than fixing these decisions upfront, we propose R3-Streaming (Remember, Respond, Reason), which formulates streaming video understanding as a cascaded control problem: for each query, the system compresses memory, judges response readiness, and routes computation sequentially, so that each downstream decision builds on progressively refined information states. To optimize this pipeline, we introduce an age-aware forgetting policy for memory compression, as aggressively compressing historical frames can yield substantial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
