Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li, Lee, Wynne Hsu

TL;DR
This paper introduces MotionEpic, a multimodal large language model for fine-grained video grounding, and a Video-of-Thought reasoning framework that breaks down complex video understanding tasks into manageable steps, significantly improving performance.
Contribution
It presents a novel video MLLM with pixel-level spatial-temporal grounding and a CoT-inspired reasoning framework for advanced video comprehension.
Findings
Boosts state-of-the-art performance on complex video QA benchmarks
First successful implementation of CoT for human-level video reasoning
Demonstrates potential for broader application in video understanding
Abstract
Existing research of video understanding still struggles to achieve in-depth comprehension and reasoning in complex videos, primarily due to the under-exploration of two key bottlenecks: fine-grained spatial-temporal perceptive understanding and cognitive-level video scene comprehension. This paper bridges the gap by presenting a novel solution. We first introduce a novel video Multimodal Large Language Model (MLLM), MotionEpic, which achieves fine-grained pixel-level spatial-temporal video grounding by integrating video spatial-temporal scene graph (STSG) representation. Building upon MotionEpic, we then develop a Video-of-Thought (VoT) reasoning framework. VoT inherits the Chain-of-Thought (CoT) core, breaking down a complex task into simpler and manageable sub-problems, and addressing them step-by-step from a low-level pixel perception to high-level cognitive interpretation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAesthetic Perception and Analysis · Embodied and Extended Cognition · Cinema and Media Studies
