Emergent Temporal Correspondences from Video Diffusion Transformers

Jisu Nam; Soowon Son; Dahyun Chung; Jiyoung Kim; Siyoon Jin; Junhwa Hur; Seungryong Kim

arXiv:2506.17220·cs.CV·June 24, 2025

Emergent Temporal Correspondences from Video Diffusion Transformers

Jisu Nam, Soowon Son, Dahyun Chung, Jiyoung Kim, Siyoon Jin, Junhwa Hur, Seungryong Kim

PDF

Open Access

TL;DR

This paper introduces DiffTrack, a framework for analyzing how Diffusion Transformers establish temporal correspondences in videos, revealing key mechanisms and enabling improved zero-shot tracking and generation.

Contribution

DiffTrack provides the first systematic quantitative analysis of temporal correspondence formation in video diffusion transformers, with novel metrics and practical applications.

Findings

01

Query-key similarities are crucial for temporal matching.

02

Temporal correspondence becomes stronger during denoising.

03

DiffTrack achieves state-of-the-art zero-shot point tracking.

Abstract

Recent advancements in video diffusion models based on Diffusion Transformers (DiTs) have achieved remarkable success in generating temporally coherent videos. Yet, a fundamental question persists: how do these models internally establish and represent temporal correspondences across frames? We introduce DiffTrack, the first quantitative analysis framework designed to answer this question. DiffTrack constructs a dataset of prompt-generated video with pseudo ground-truth tracking annotations and proposes novel evaluation metrics to systematically analyze how each component within the full 3D attention mechanism of DiTs (e.g., representations, layers, and timesteps) contributes to establishing temporal correspondences. Our analysis reveals that query-key similarities in specific, but not all, layers play a critical role in temporal matching, and that this matching becomes increasingly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Face recognition and analysis