Grounding Video Reasoning in Physical Signals
Alibay Osmanli, Zixu Cheng, Shaogang Gong

TL;DR
This paper introduces a comprehensive benchmark for physical video understanding that evaluates models across multiple sources, domains, prompts, and conditions, emphasizing the importance of physical grounding and robustness.
Contribution
It extends existing evaluation frameworks by creating a multi-source, multi-domain benchmark with detailed diagnostics for physical reasoning in videos.
Findings
Physics remains the strongest reasoning regime across models.
VSTAR-like prompts are effective for non-physics semantic comparison.
Spatial grounding is the most challenging aspect across settings.
Abstract
Physical video understanding requires more than naming an event correctly. A model can answer a question about pouring, sliding, or collision from textual regularities while still failing to localize the event in time or space. We introduce a grounded benchmark for physical video understanding that extends the what--when--where evaluation structure of V-STaR to four video sources, six physics domains, three prompt families (physics, vstar_like, and neutral_rstr), and four input conditions (original, shuffled, ablated, and frame-masked). The benchmark contains 1,560 base video clips from SSV2, YouCook2, HoloAssist, and Roundabout-TAU. Each clip is first converted into a shared grounded event record, and the three query families are derived from that record. Temporal and spatial targets are shared across prompt families, while the non-physics families use deterministic family-appropriate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
