Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World

Yuzhi Huang; Kairun Wen; Rongxin Gao; Dongxuan Liu; Yibin Lou; Jie Wu; Jing Xu; Jian Zhang; Zheng Yang; Yunlong Lin; Chenxin Li; Panwang Pan; Junbin Lu; Jingyan Jiang; Xinghao Ding; Yue Huang; Zhi Wang

arXiv:2603.12746·cs.CV·March 16, 2026

Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World

Yuzhi Huang, Kairun Wen, Rongxin Gao, Dongxuan Liu, Yibin Lou, Jie Wu, Jing Xu, Jian Zhang, Zheng Yang, Yunlong Lin, Chenxin Li, Panwang Pan, Junbin Lu, Jingyan Jiang, Xinghao Ding, Yue Huang, Zhi Wang

PDF

Open Access

TL;DR

This paper introduces Dyn-Bench, a comprehensive benchmark for evaluating multimodal large language models' ability to perceive, track, and reason about spatio-temporal dynamics in the physical 4D world, revealing current limitations and proposing structured integration methods for improvement.

Contribution

The paper presents Dyn-Bench, a large-scale benchmark for spatio-temporal reasoning, and proposes structured integration techniques that significantly improve MLLMs' dynamic perception and reasoning capabilities.

Findings

01

Existing models struggle with consistent spatio-temporal reasoning and dynamic object grounding.

02

Structured integration methods outperform conventional prompting strategies.

03

Dyn-Bench provides a scalable platform for evaluating physical 4D world understanding.

Abstract

Humans inhabit a physical 4D world where geometric structure and semantic content evolve over time, constituting a dynamic 4D reality (spatial with temporal dimension). While current Multimodal Large Language Models (MLLMs) excel in static visual understanding, can they also be adept at "thinking in dynamics", i.e., perceive, track and reason about spatio-temporal dynamics in evolving scenes? To systematically assess their spatio-temporal reasoning and localized dynamics perception capabilities, we introduce Dyn-Bench, a large-scale benchmark built from diverse real-world and synthetic video datasets, enabling robust and scalable evaluation of spatio-temporal understanding. Through multi-stage filtering from massive 2D and 4D data sources, Dyn-Bench provides a high-quality collection of dynamic scenes, comprising 1k videos, 7k visual question answering (VQA) pairs, and 3k dynamic object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Generative Adversarial Networks and Image Synthesis