Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos
Youngseo Kim, Dohyun Kim, Geonhee Han, and Paul Hongsuck Seo

TL;DR
This paper reveals that image diffusion models can be used for zero-shot object tracking in videos by interpreting their self-attention maps as semantic propagation kernels, enabling robust segmentation without training on video data.
Contribution
The work introduces a novel interpretation of diffusion models' self-attention as semantic propagation kernels and develops DRIFT, a zero-shot video object tracking framework leveraging these insights.
Findings
Achieves state-of-the-art zero-shot video segmentation performance.
Demonstrates effective test-time optimization strategies for label propagation.
Extends image diffusion models for temporal propagation in videos.
Abstract
Image diffusion models, though originally developed for image generation, implicitly capture rich semantic structures that enable various recognition and localization tasks beyond synthesis. In this work, we investigate their self-attention maps can be reinterpreted as semantic label propagation kernels, providing robust pixel-level correspondences between relevant image regions. Extending this mechanism across frames yields a temporal propagation kernel that enables zero-shot object tracking via segmentation in videos. We further demonstrate the effectiveness of test-time optimization strategies-DDIM inversion, textual inversion, and adaptive head weighting-in adapting diffusion features for robust and consistent label propagation. Building on these findings, we introduce DRIFT, a framework for object tracking in videos leveraging a pretrained image diffusion model with SAM-guided mask…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection · Medical Image Segmentation Techniques
