TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

Jisu Nam; Jahyeok Koo; Soowon Son; Jaewoo Jung; Honggyu An; Junhwa Hur; Seungryong Kim

arXiv:2605.12587·cs.CV·May 14, 2026

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

Jisu Nam, Jahyeok Koo, Soowon Son, Jaewoo Jung, Honggyu An, Junhwa Hur, Seungryong Kim

PDF

1 Repo

TL;DR

TrackCraft3R leverages pre-trained video diffusion transformers to perform dense 3D tracking from monocular videos, introducing a novel reference-anchored tracking approach that outperforms prior methods in accuracy and efficiency.

Contribution

This work is the first to repurpose video diffusion transformers as a fast, reference-anchored dense 3D tracker using a dual-latent and temporal RoPE alignment design.

Findings

01

Achieves state-of-the-art results on 3D tracking benchmarks.

02

Runs 1.3x faster and uses 4.6x less memory than previous methods.

03

Demonstrates robustness to large motions and long videos.

Abstract

Dense 3D tracking from monocular video is fundamental to dynamic scene understanding. While recent 3D foundation models provide reliable per-frame geometry, recovering object motion in this geometry remains challenging and benefits from strong motion priors learned from real-world videos. Existing 3D trackers either follow iterative paradigms trained from scratch on synthetic data or fine-tune 3D reconstruction models learned from static multi-view images, both lacking real-world motion priors. Pre-trained video diffusion transformers (video DiTs) offer rich spatio-temporal priors from internet-scale videos, making them a promising foundation for 3D tracking. However, their frame-anchored formulation, which generates each frame's content, is fundamentally mismatched with reference-anchored dense 3D tracking, which must follow the same physical points from a reference frame across time.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cvlab-kaist/TrackCraft3r
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.