MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence

Xingyilang Yin; Chengzhengxu Li; Jiahao Chang; Chi-Man Pun; Xiaodong Cun

arXiv:2603.00515·cs.CV·March 3, 2026

MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence

Xingyilang Yin, Chengzhengxu Li, Jiahao Chang, Chi-Man Pun, Xiaodong Cun

PDF

Open Access 1 Models

TL;DR

MLLM-4D introduces a novel framework that enhances multimodal large language models with 4D spatial-temporal reasoning from visual inputs, utilizing new data curation and training strategies.

Contribution

The paper presents a cost-effective data curation pipeline and a post-training strategy that significantly improve 4D understanding and reasoning in MLLMs without architectural modifications.

Findings

01

Achieves state-of-the-art spatial-temporal reasoning from 2D RGB inputs.

02

Develops high-quality 4D spatiotemporal datasets from stereo videos.

03

Demonstrates effective 4D reasoning capabilities through extensive experiments.

Abstract

Humans are born with vision-based 4D spatial-temporal intelligence, which enables us to perceive and reason about the evolution of 3D space over time from purely visual inputs. Despite its importance, this capability remains a significant bottleneck for current multimodal large language models (MLLMs). To tackle this challenge, we introduce MLLM-4D, a comprehensive framework designed to bridge the gaps in training data curation and model post-training for spatiotemporal understanding and reasoning. On the data front, we develop a cost-efficient data curation pipeline that repurposes existing stereo video datasets into high-quality 4D spatiotemporal instructional data. This results in the MLLM4D-2M and MLLM4D-R1-30k datasets for Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT), alongside MLLM4D-Bench for comprehensive evaluation. Regarding model training, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
flow666/MLLM-4D
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning