Denoise to Track: Harnessing Video Diffusion Priors for Robust Correspondence
Tianyu Yuan, Yuanbo Yang, Lin-Zhuo Chen, Yao Yao, Zhuzhong Qian

TL;DR
This paper introduces HeFT, a zero-shot video point tracking method that leverages pretrained video diffusion models' internal representations, focusing on low-frequency features and attention head specialization to improve correspondence accuracy without training data.
Contribution
The paper provides a novel analysis of Video Diffusion Transformer representations and develops a feature selection strategy that enhances zero-shot tracking performance.
Findings
HeFT achieves state-of-the-art zero-shot tracking on TAP-Vid benchmarks.
Low-frequency components are crucial for correspondence, high-frequency introduces noise.
Attention heads have specialized roles in matching, semantics, and position encoding.
Abstract
In this work, we introduce HeFT (Head-Frequency Tracker), a zero-shot point tracking framework that leverages the visual priors of pretrained video diffusion models. To better understand how they encode spatiotemporal information, we analyze the internal representations of Video Diffusion Transformer (VDiT). Our analysis reveals that attention heads act as minimal functional units with distinct specializations for matching, semantic understanding, and positional encoding. Additionally, we find that the low-frequency components in VDiT features are crucial for establishing correspondences, whereas the high-frequency components tend to introduce noise. Building on these insights, we propose a head- and frequency-aware feature selection strategy that jointly selects the most informative attention head and low-frequency components to enhance tracking performance. Specifically, our method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Face recognition and analysis · Domain Adaptation and Few-Shot Learning
