Video-CoM: Interactive Video Reasoning via Chain of Manipulations
Hanoona Rasheed, Mohammed Zumri, Muhammad Maaz, Ming-Hsuan Yang, Fahad Shahbaz Khan, Salman Khan

TL;DR
Video-CoM introduces an interactive video reasoning paradigm where models actively manipulate video content through iterative actions, enabling deeper understanding and reasoning, and achieves state-of-the-art results with fewer training samples.
Contribution
The paper proposes a novel interactive reasoning framework with a chain of manipulations, supported by a new instruction dataset and reinforcement learning, advancing video understanding beyond passive analysis.
Findings
Achieves 3.6% performance improvement over state-of-the-art models.
Uses fewer training samples than comparable models.
Reasoning-aware rewards enhance accuracy and interpretability.
Abstract
Recent multimodal large language models (MLLMs) have advanced video understanding, yet most still "think about videos" ie once a video is encoded, reasoning unfolds entirely in text, treating visual input as a static context. This passive paradigm creates a semantic bottleneck: models cannot rewatch, refocus, or verify evidence, leading to shallow visual reasoning on tasks requiring fine grained spatio temporal understanding. In this work, we introduce Interactive Video Reasoning, a new paradigm that transforms video into an active cognitive workspace, enabling models to "think with videos". Our model, Video CoM, reasons through a Chain of Manipulations (CoM), performing iterative visual actions to gather and refine evidence. To support this behavior, we construct Video CoM Instruct, an 18K instruction tuning dataset curated for multi step manipulation reasoning. Beyond supervised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
