TL;DR
This paper identifies limitations in current multimodal large language models' understanding of physics, introduces benchmarks for evaluation, and proposes Scene Dynamic Field to significantly improve their intuitive physics reasoning capabilities.
Contribution
It introduces two fundamental physics reasoning benchmarks and a novel Scene Dynamic Field approach that enhances MLLMs' understanding of physical dynamics.
Findings
MLLMs perform poorly on physics reasoning benchmarks.
Scene Dynamic Field improves fluid task performance by up to 20.7%.
SDF generalizes well to unseen physical domains.
Abstract
While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in image and video understanding, their ability to comprehend the physical world has become an increasingly important research focus. Despite their improvements, current MLLMs struggle significantly with high-level physics reasoning. In this work, we investigate the first step of physical reasoning, i.e., intuitive physics understanding, revealing substantial limitations in understanding the dynamics of continuum objects. To isolate and evaluate this specific capability, we introduce two fundamental benchmark tasks: Next Frame Selection (NFS) and Temporal Coherence Verification (TCV). Our experiments demonstrate that even state-of-the-art MLLMs perform poorly on these foundational tasks. To address this limitation, we propose Scene Dynamic Field (SDF), a concise approach that leverages physics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
