Denoise to Track: Harnessing Video Diffusion Priors for Robust Correspondence

Tianyu Yuan; Yuanbo Yang; Lin-Zhuo Chen; Yao Yao; Zhuzhong Qian

arXiv:2512.04619·cs.CV·March 24, 2026

Denoise to Track: Harnessing Video Diffusion Priors for Robust Correspondence

Tianyu Yuan, Yuanbo Yang, Lin-Zhuo Chen, Yao Yao, Zhuzhong Qian

PDF

Open Access

TL;DR

This paper introduces HeFT, a zero-shot video point tracking method that leverages pretrained video diffusion models' internal representations, focusing on low-frequency features and attention head specialization to improve correspondence accuracy without training data.

Contribution

The paper provides a novel analysis of Video Diffusion Transformer representations and develops a feature selection strategy that enhances zero-shot tracking performance.

Findings

01

HeFT achieves state-of-the-art zero-shot tracking on TAP-Vid benchmarks.

02

Low-frequency components are crucial for correspondence, high-frequency introduces noise.

03

Attention heads have specialized roles in matching, semantics, and position encoding.

Abstract

In this work, we introduce HeFT (Head-Frequency Tracker), a zero-shot point tracking framework that leverages the visual priors of pretrained video diffusion models. To better understand how they encode spatiotemporal information, we analyze the internal representations of Video Diffusion Transformer (VDiT). Our analysis reveals that attention heads act as minimal functional units with distinct specializations for matching, semantic understanding, and positional encoding. Additionally, we find that the low-frequency components in VDiT features are crucial for establishing correspondences, whereas the high-frequency components tend to introduce noise. Building on these insights, we propose a head- and frequency-aware feature selection strategy that jointly selects the most informative attention head and low-frequency components to enhance tracking performance. Specifically, our method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Face recognition and analysis · Domain Adaptation and Few-Shot Learning