Can Visual Foundation Models Achieve Long-term Point Tracking?
G\"orkay Aydemir, Weidi Xie, Fatma G\"uney

TL;DR
This paper evaluates the ability of large-scale visual foundation models to perform long-term point tracking, revealing their potential and limitations in complex environments without extensive training.
Contribution
It systematically assesses the geometric awareness of foundation models like Stable Diffusion and DINOv2 for long-term correspondence tasks, including zero-shot and fine-tuning scenarios.
Findings
Stable Diffusion and DINOv2 excel in zero-shot geometric correspondence.
DINOv2 performs comparably to supervised models after fine-tuning.
Foundation models show promise as initialization for correspondence learning.
Abstract
Large-scale vision foundation models have demonstrated remarkable success across various tasks, underscoring their robust generalization capabilities. While their proficiency in two-view correspondence has been explored, their effectiveness in long-term correspondence within complex environments remains unexplored. To address this, we evaluate the geometric awareness of visual foundation models in the context of point tracking: (i) in zero-shot settings, without any training; (ii) by probing with low-capacity layers; (iii) by fine-tuning with Low Rank Adaptation (LoRA). Our findings indicate that features from Stable Diffusion and DINOv2 exhibit superior geometric correspondence abilities in zero-shot settings. Furthermore, DINOv2 achieves performance comparable to supervised models in adaptation settings, demonstrating its potential as a strong initialization for correspondence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics
MethodsDiffusion
