Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at   Pixel Level

Andong Deng; Tongjia Chen; Shoubin Yu; Taojiannan Yang; Lincoln; Spencer; Yapeng Tian; Ajmal Saeed Mian; Mohit Bansal; Chen Chen

arXiv:2411.09921·cs.CV·April 7, 2025

Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level

Andong Deng, Tongjia Chen, Shoubin Yu, Taojiannan Yang, Lincoln, Spencer, Yapeng Tian, Ajmal Saeed Mian, Mohit Bansal, Chen Chen

PDF

Open Access

TL;DR

This paper introduces a new task called Motion-Grounded Video Reasoning that requires models to generate pixel-level visual answers based on questions, advancing the understanding of motion and spatiotemporal reasoning in videos.

Contribution

It proposes a novel task, creates the GROUNDMORE dataset with diverse question types, and develops the MORA baseline model to enhance motion reasoning at the pixel level.

Findings

01

MORA outperforms existing models by 21.5% on GROUNDMORE.

02

GROUNDMORE contains 249K object masks across 1,715 videos with diverse question types.

03

The task advances the capabilities of models in implicit motion reasoning and spatiotemporal grounding.

Abstract

In this paper, we introduce Motion-Grounded Video Reasoning, a new motion understanding task that requires generating visual answers (video segmentation masks) according to the input question, and hence needs implicit spatiotemporal reasoning and grounding. This task extends existing spatiotemporal grounding work focusing on explicit action/motion grounding, to a more general format by enabling implicit reasoning via questions. To facilitate the development of the new task, we collect a large-scale dataset called GROUNDMORE, which comprises 1,715 video clips, 249K object masks that are deliberately designed with 4 question types (Causal, Sequential, Counterfactual, and Descriptive) for benchmarking deep and comprehensive motion reasoning abilities. GROUNDMORE uniquely requires models to generate visual answers, providing a more concrete and visually interpretable response than plain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Video Analysis and Summarization