TL;DR
TrackCraft3R leverages pre-trained video diffusion transformers to perform dense 3D tracking from monocular videos, introducing a novel reference-anchored tracking approach that outperforms prior methods in accuracy and efficiency.
Contribution
This work is the first to repurpose video diffusion transformers as a fast, reference-anchored dense 3D tracker using a dual-latent and temporal RoPE alignment design.
Findings
Achieves state-of-the-art results on 3D tracking benchmarks.
Runs 1.3x faster and uses 4.6x less memory than previous methods.
Demonstrates robustness to large motions and long videos.
Abstract
Dense 3D tracking from monocular video is fundamental to dynamic scene understanding. While recent 3D foundation models provide reliable per-frame geometry, recovering object motion in this geometry remains challenging and benefits from strong motion priors learned from real-world videos. Existing 3D trackers either follow iterative paradigms trained from scratch on synthetic data or fine-tune 3D reconstruction models learned from static multi-view images, both lacking real-world motion priors. Pre-trained video diffusion transformers (video DiTs) offer rich spatio-temporal priors from internet-scale videos, making them a promising foundation for 3D tracking. However, their frame-anchored formulation, which generates each frame's content, is fundamentally mismatched with reference-anchored dense 3D tracking, which must follow the same physical points from a reference frame across time.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
