TL;DR
Motion-o introduces an explicit, verifiable motion component to video reasoning models, enhancing their ability to reason about object trajectories and dynamic evidence in videos.
Contribution
It formalizes Spatial-Temporal-Trajectory reasoning and extends vision-language models with a structured motion pathway called MCoT, improving trajectory-faithful reasoning.
Findings
Motion-o improves trajectory-based reasoning across multiple benchmarks.
It enhances interpretability by making object motion explicit and verifiable.
The approach does not require architectural changes to existing models.
Abstract
Recent video reasoning models increasingly produce spatio-temporal evidence chains that localize objects at specific timestamps. While these traces improve interpretability by grounding \emph{where} and \emph{when} evidence appears, they often leave the motion connecting observations, the \textit{how}, implicit. This makes dynamic and trajectory-dependent claims difficult to supervise, verify, or penalize when unsupported by the video. We formalize this missing component as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric extension to vision-language models (VLMs) that makes trajectories explicit and verifiable. Motion-o augments evidence chains with Motion Chain of Thought (MCoT), a structured pathway that represents object motion through a discrete \texttt{<motion/>} tag summarizing direction, speed, and scale change. To supervise MCoT, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
