Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers
Zhengbo Zhang, Li Xu, Duo Peng, Hossein Rahmani, Jun Liu

TL;DR
Diff-Tracker utilizes pre-trained text-to-image diffusion models with learned prompts and online updates to perform unsupervised visual tracking, achieving state-of-the-art results across multiple benchmarks.
Contribution
It introduces a novel unsupervised tracking method leveraging pre-trained diffusion models with dynamic prompt learning and updating mechanisms.
Findings
Achieves state-of-the-art performance on five benchmark datasets.
Effectively recognizes and tracks targets without supervision.
Demonstrates robustness across diverse tracking scenarios.
Abstract
We introduce Diff-Tracker, a novel approach for the challenging unsupervised visual tracking task leveraging the pre-trained text-to-image diffusion model. Our main idea is to leverage the rich knowledge encapsulated within the pre-trained diffusion model, such as the understanding of image semantics and structural information, to address unsupervised visual tracking. To this end, we design an initial prompt learner to enable the diffusion model to recognize the tracking target by learning a prompt representing the target. Furthermore, to facilitate dynamic adaptation of the prompt to the target's movements, we propose an online prompt updater. Extensive experiments on five benchmark datasets demonstrate the effectiveness of our proposed method, which also achieves state-of-the-art performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
MethodsDiffusion
