Compositional 4D Dynamic Scenes Understanding with Physics Priors for   Video Question Answering

Xingrui Wang; Wufei Ma; Angtian Wang; Shuo Chen; Adam Kortylewski,; Alan Yuille

arXiv:2406.00622·cs.CV·April 24, 2025

Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering

Xingrui Wang, Wufei Ma, Angtian Wang, Shuo Chen, Adam Kortylewski,, Alan Yuille

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a new dataset and a neural-symbolic model that explicitly incorporate physics priors to improve understanding of 4D dynamic scenes in video question answering, focusing on physical concepts like velocity, acceleration, and collisions.

Contribution

The work presents DynSuperCLEVR dataset and NS-4DPhysics model, enabling better reasoning about 4D scene dynamics through explicit scene representations and physics-informed reasoning.

Findings

01

NS-4DPhysics outperforms previous models on 4D dynamics questions

02

Explicit scene representations improve understanding of physical interactions

03

Large models struggle with 4D dynamic reasoning without physics priors

Abstract

For vision-language models (VLMs), understanding the dynamic properties of objects and their interactions in 3D scenes from videos is crucial for effective reasoning about high-level temporal and action semantics. Although humans are adept at understanding these properties by constructing 3D and temporal (4D) representations of the world, current video understanding models struggle to extract these dynamic semantics, arguably because these models use cross-frame reasoning without underlying knowledge of the 3D/4D scenes. In this work, we introduce DynSuperCLEVR, the first video question answering dataset that focuses on language understanding of the dynamic properties of 3D objects. We concentrate on three physical concepts -- velocity, acceleration, and collisions within 4D scenes. We further generate three types of questions, including factual queries, future predictions, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

XingruiWang/SuperCLEVR-Physics
noneOfficial

Videos

Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition