Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models

Nanxi Li; Xiang Wang; Yuanjie Chen; Haode Zhang; Hong Li; Yong-Lu Li

arXiv:2604.03302·cs.CV·April 7, 2026

Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models

Nanxi Li, Xiang Wang, Yuanjie Chen, Haode Zhang, Hong Li, Yong-Lu Li

PDF

1 Repo 1 Video

TL;DR

This paper identifies limitations in current multimodal large language models' understanding of physics, introduces benchmarks for evaluation, and proposes Scene Dynamic Field to significantly improve their intuitive physics reasoning capabilities.

Contribution

It introduces two fundamental physics reasoning benchmarks and a novel Scene Dynamic Field approach that enhances MLLMs' understanding of physical dynamics.

Findings

01

MLLMs perform poorly on physics reasoning benchmarks.

02

Scene Dynamic Field improves fluid task performance by up to 20.7%.

03

SDF generalizes well to unseen physical domains.

Abstract

While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in image and video understanding, their ability to comprehend the physical world has become an increasingly important research focus. Despite their improvements, current MLLMs struggle significantly with high-level physics reasoning. In this work, we investigate the first step of physical reasoning, i.e., intuitive physics understanding, revealing substantial limitations in understanding the dynamics of continuum objects. To isolate and evaluate this specific capability, we introduce two fundamental benchmark tasks: Next Frame Selection (NFS) and Temporal Coherence Verification (TCV). Our experiments demonstrate that even state-of-the-art MLLMs perform poorly on these foundational tasks. To address this limitation, we propose Scene Dynamic Field (SDF), a concise approach that leverages physics…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

andylinx/Scene-Dynamic-Field
github

Videos

Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models· slideslive