FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos
Yulu Gan, Ligeng Zhu, Dandan Shan, Baifeng Shi, Hongxu Yin, Boris Ivanovic, Song Han, Trevor Darrell, Jitendra Malik, Marco Pavone, Boyi Li

TL;DR
FoundationMotion introduces an automated pipeline for creating large-scale, fine-grained motion datasets from videos, enabling improved training of models for physical reasoning and motion understanding.
Contribution
It presents a fully automated data curation method that combines object tracking and LLMs to generate detailed motion annotations, facilitating scalable dataset creation.
Findings
Models fine-tuned on FoundationMotion data outperform baselines.
Fine-tuned models show improved motion reasoning across benchmarks.
The pipeline enables scalable, high-quality motion dataset generation.
Abstract
Motion understanding is fundamental to physical reasoning, enabling models to infer dynamics and predict future states. However, state-of-the-art models still struggle on recent motion benchmarks, primarily due to the scarcity of large-scale, fine-grained motion datasets. Existing motion datasets are often constructed from costly manual annotation, severely limiting scalability. To address this challenge, we introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets. Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models (LLMs) to generate fine-grained captions and diverse question-answer pairs about motion and spatial reasoning. Using datasets produced by this pipeline, we fine-tune open-source models including…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The automated pipeline will contribute enough high-quality training data for model training in the community. 2. From the experiment results, these training data indeed provide an obvious performance boost for different models and even surpass the proprietary model Gemini-2.5-Flash. 3. A "how" motion benchmark is proposed. To some extent, it fills the gap of the lack of such motion questions (compared with the "what" questions). 4. In the Analysis section, ablation studies on the data generat
Major: 1. In line 216, the function s_m = ..., what is the last term δ· max(∆r) representing? 2. In line 283, the authors mention the 7 dimensions of motion caption. Are there any examples or explanations? Or it is hard to understand, e.g., (7) the evolving spatial relationships. 3. What is the value of the proposed "how" benchmark? The authors only showcase the results in Table 1, but there is no in-depth analysis of how existing models have good performance on "what" questions, as mentioned in
- Introduce a novel method that uses object trackers to create structured trajectory data (JSONs), which is then fed to an LLM to auto-generate a large-scale motion dataset. - The paper provides a key ablation study showing that giving the model this structured tracking data results in much higher-quality questions and answers compared to giving it only the raw video frames.
- The method in paper filters out videos with significant camera motion. So what is the model's generalizable ability to real-world scenarios where the camera is moving? - The paper lacks an analysis of error propagation. Since the pipeline is detect > track > caption, errors from upstream models (e.g., tracking failures) can be as noise or incorrect ground truth in the final dataset, yet the impact of this noise on model training is not measured. - The method's reliance on 2D bounding boxes
Strong Engineering Contribution — The data curation pipeline is meticulously designed, combining open-vocabulary detection, hierarchical human-centric tracking, and structured LLM-based caption/QA generation. High-Quality Evaluation — The authors not only evaluate on public motion reasoning benchmarks (MotionBench, VLM4D) but also construct four new domain-specific “how motion” benchmarks (Daily, Robotics, AV-Car, AV-Hand), enhancing reproducibility and completeness. Clear Performance Gains
Limited novelty: The work is technically solid, but the major contribution is building a large, well-engineered pipeline rather than proposing new algorithms or models. It feels more like a strong system integration effort than a conceptual or methodological innovation. Moreover, automatic/semi-automatic data curation/labelling is a known technique. Error propagation: Since the data is mainly curated automatically using existing models, errors from these models will propagate into the dataset.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Human Pose and Action Recognition
