Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation

Su Sun; Cheng Zhao; Himangi Mittal; Gaurav Mittal; Rohith Kukkala; Yingjie Victor Chen; Mei Chen

arXiv:2512.06158·cs.CV·December 9, 2025

Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation

Su Sun, Cheng Zhao, Himangi Mittal, Gaurav Mittal, Rohith Kukkala, Yingjie Victor Chen, Mei Chen

PDF

Open Access

TL;DR

This paper introduces Track4DGen, a novel framework that integrates explicit tracker-derived motion priors into a two-stage process for generating temporally coherent 4D dynamic objects from sparse inputs, improving stability and fidelity.

Contribution

The paper presents a new two-stage method combining multi-view diffusion and point tracking to enhance 4D object generation with explicit motion priors, surpassing existing baselines.

Findings

01

Outperforms baselines on multi-view video and 4D generation benchmarks.

02

Produces temporally stable and text-editable 4D assets.

03

Curates a new dataset, Sketchfab28, for benchmarking 4D generation.

Abstract

Generating dynamic 4D objects from sparse inputs is difficult because it demands joint preservation of appearance and motion coherence across views and time while suppressing artifacts and temporal drift. We hypothesize that the view discrepancy arises from supervision limited to pixel- or latent-space video-diffusion losses, which lack explicitly temporally aware, feature-level tracking guidance. We present \emph{Track4DGen}, a two-stage framework that couples a multi-view video diffusion model with a foundation point tracker and a hybrid 4D Gaussian Splatting (4D-GS) reconstructor. The central idea is to explicitly inject tracker-derived motion priors into intermediate feature representations for both multi-view video generation and 4D-GS. In Stage One, we enforce dense, feature-level point correspondences inside the diffusion generator, producing temporally consistent features that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Face recognition and analysis