Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering
Xingrui Wang, Wufei Ma, Angtian Wang, Shuo Chen, Adam Kortylewski,, Alan Yuille

TL;DR
This paper introduces a new dataset and a neural-symbolic model that explicitly incorporate physics priors to improve understanding of 4D dynamic scenes in video question answering, focusing on physical concepts like velocity, acceleration, and collisions.
Contribution
The work presents DynSuperCLEVR dataset and NS-4DPhysics model, enabling better reasoning about 4D scene dynamics through explicit scene representations and physics-informed reasoning.
Findings
NS-4DPhysics outperforms previous models on 4D dynamics questions
Explicit scene representations improve understanding of physical interactions
Large models struggle with 4D dynamic reasoning without physics priors
Abstract
For vision-language models (VLMs), understanding the dynamic properties of objects and their interactions in 3D scenes from videos is crucial for effective reasoning about high-level temporal and action semantics. Although humans are adept at understanding these properties by constructing 3D and temporal (4D) representations of the world, current video understanding models struggle to extract these dynamic semantics, arguably because these models use cross-frame reasoning without underlying knowledge of the 3D/4D scenes. In this work, we introduce DynSuperCLEVR, the first video question answering dataset that focuses on language understanding of the dynamic properties of 3D objects. We concentrate on three physical concepts -- velocity, acceleration, and collisions within 4D scenes. We further generate three types of questions, including factual queries, future predictions, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
