MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

Xiyang Wu; Zongxia Li; Jihui Jin; Guangyao Shi; Gouthaman KV; Vishnu Raj; Nilotpal Sinha; Jingxi Chen; Fan Du; Dinesh Manocha

arXiv:2511.18373·cs.CV·April 14, 2026

MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

Xiyang Wu, Zongxia Li, Jihui Jin, Guangyao Shi, Gouthaman KV, Vishnu Raj, Nilotpal Sinha, Jingxi Chen, Fan Du, Dinesh Manocha

PDF

TL;DR

This paper introduces MASS, a physics-aware model that enhances vision-language understanding of motion and spatial interactions, supported by a new benchmark and reinforcement fine-tuning.

Contribution

It presents a novel approach to incorporate physical motion cues into VLMs, along with a comprehensive benchmark for physics reasoning in videos.

Findings

01

Refined VLMs outperform baselines and prior state-of-the-art models.

02

Achieve performance close to closed-source models with only a 2% gap.

03

Demonstrate improved physics reasoning and comprehension in videos.

Abstract

Vision Language Models (VLMs) perform well on standard video tasks but struggle with physics-related reasoning involving motion dynamics and spatial interactions. We present a novel approach to address this gap by translating physical-world context cues into interpretable representations aligned with VLM perception, comprehension, and reasoning. We introduce MASS, a model-agnostic approach that injects spatiotemporal signals into the VLM language space via depth-based 3D encoding and visual grounding, coupled with a motion tracker for object dynamics. We also contribute a comprehensive benchmark, MASS-Bench, consisting of 4,350 real-world and AIGC videos and 8,361 free-form video question-answering pairs focused on physics-related comprehension tasks, with detailed annotations including visual detections and grounding over sub-segments, as well as full-sequence 3D motion tracking of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.