LMM-Track4D: Eliciting 4D Dynamic Reasoning in LMMs via Trajectory-Grounded Dialogue

Chaoyue Li; Yongxue Xu; Jie Feng; Jiayu Ding

arXiv:2605.19390·cs.CV·May 20, 2026

LMM-Track4D: Eliciting 4D Dynamic Reasoning in LMMs via Trajectory-Grounded Dialogue

Chaoyue Li, Yongxue Xu, Jie Feng, Jiayu Ding

PDF

1 Repo

TL;DR

This paper introduces a new benchmark and model for 4D dynamic reasoning in large multimodal models, focusing on spatiotemporal dialogue and trajectory prediction.

Contribution

It proposes a novel task, dataset, and model architecture to improve 4D reasoning capabilities in multimodal models.

Findings

01

LMM-Track4D outperforms strong baselines on Track4D-Bench.

02

Explicit dynamic state modeling enhances 4D reasoning.

03

The approach effectively handles occlusion and viewpoint variation.

Abstract

Recent large multimodal models (LMMs) have become increasingly capable on image and video understanding, yet still struggle to sustain 4D continuous spatiotemporal dynamic reasoning. To study this capability gap, we formulate trajectory-grounded multi-turn spatiotemporal dialogue, a new task in which a model must answer spatiotemporal queries while returning structured 3D target trajectories over an entire short clip or a specified segment of a longer clip, and introduce Track4D-Bench, a benchmark with 526 clip-level dialogue samples spanning 23.5k frames and 7.5k object annotations, for training and evaluation. Building on this task, we propose LMM-Track4D, which combines RTGE (Ray--Time Geometry Encoding), a dedicated streaming state token TRK for long-horizon dynamic propagation, and an Object-Slot Kinematic, Residual-Anchor (OSK-RA) decoder for stable 4-step 3D state estimation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mikubaka88/LMM-Track4D
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.