TL;DR
This paper introduces a novel online video understanding framework with transparent reasoning and evidence-aligned response timing, addressing challenges of real-time analysis and decision transparency.
Contribution
It proposes extbf{ extsc{EvidenceAlign}}, a framework with a transparent reasoning controller and hierarchical memory system for evidence-aligned, online video understanding.
Findings
Achieved 71.6% on StreamingBench, surpassing previous state-of-the-art.
Demonstrated precise response timing matching evidence appearance in videos.
Showed improved accuracy from 67.63% to 71.60% on StreamingBench with extsc{Thinking-QwenVL}.
Abstract
Visual agents operating in the wild must respond to queries precisely when sufficient evidence first appears in a video stream, a critical capability that is overlooked by conventional video LLMs evaluated in offline settings. The shift to an online, streaming paradigm introduces significant challenges: a lack of decision transparency, the difficulty of aligning response timing with visual evidence, and the need to maintain a global, causally consistent understanding under tight computational budgets. To address these issues, we propose a novel framework that decouples reasoning control from memory integration. We introduce \textbf{\model{}}, an instantiation of this framework with two core components. First, the \emph{Active Thinking Decision Maker (ATDM)} is a transparent reasoning controller that externalizes its decision process using observable progress () and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
