Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision
Chenshuang Zhang, Kang Zhang, Joon Son Chung, In So Kweon, Junmo Kim, Chengzhi Mao

TL;DR
This paper demonstrates that pre-trained video diffusion models can be used for self-supervised tracking of similar-looking objects by leveraging their inherent motion representations, improving over existing methods without additional training.
Contribution
The authors reveal that video diffusion models naturally learn motion features useful for tracking, enabling a new self-supervised approach that outperforms recent methods on relevant benchmarks.
Findings
Up to 6-point improvement over recent self-supervised trackers.
Diffusion models isolate motion in early denoising stages.
Effective tracking of identical objects across challenging conditions.
Abstract
Distinguishing visually similar objects by their motion remains a critical challenge in computer vision. Although supervised trackers show promise, contemporary self-supervised trackers struggle when visual cues become ambiguous, limiting their scalability and generalization without extensive labeled data. We find that pre-trained video diffusion models inherently learn motion representations suitable for tracking without task-specific training. This ability arises because their denoising process isolates motion in early, high-noise stages, distinct from later appearance refinement. Capitalizing on this discovery, our self-supervised tracker significantly improves performance in distinguishing visually similar objects, an underexplored failure point for existing methods. Our method achieves up to a 6-point improvement over recent self-supervised approaches on established benchmarks and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
