TL;DR
This paper introduces a new benchmark and model for 4D dynamic reasoning in large multimodal models, focusing on spatiotemporal dialogue and trajectory prediction.
Contribution
It proposes a novel task, dataset, and model architecture to improve 4D reasoning capabilities in multimodal models.
Findings
LMM-Track4D outperforms strong baselines on Track4D-Bench.
Explicit dynamic state modeling enhances 4D reasoning.
The approach effectively handles occlusion and viewpoint variation.
Abstract
Recent large multimodal models (LMMs) have become increasingly capable on image and video understanding, yet still struggle to sustain 4D continuous spatiotemporal dynamic reasoning. To study this capability gap, we formulate trajectory-grounded multi-turn spatiotemporal dialogue, a new task in which a model must answer spatiotemporal queries while returning structured 3D target trajectories over an entire short clip or a specified segment of a longer clip, and introduce Track4D-Bench, a benchmark with 526 clip-level dialogue samples spanning 23.5k frames and 7.5k object annotations, for training and evaluation. Building on this task, we propose LMM-Track4D, which combines RTGE (Ray--Time Geometry Encoding), a dedicated streaming state token TRK for long-horizon dynamic propagation, and an Object-Slot Kinematic, Residual-Anchor (OSK-RA) decoder for stable 4-step 3D state estimation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
