VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models
Jesimon Barreto, Carlos Caetano, Andr\'e Araujo, William Robson Schwartz

TL;DR
VESSA introduces a self-supervised domain adaptation method for vision foundation models using multi-view object-centric videos, improving their robustness and performance without requiring annotations.
Contribution
It proposes a novel self-supervised fine-tuning approach leveraging multi-view videos, addressing the challenge of adapting vision models without labels in new domains.
Findings
VESSA improves downstream classification accuracy across multiple models and datasets.
The method effectively learns robustness to varied capture conditions.
VESSA outperforms previous adaptation techniques in experiments.
Abstract
Foundation models have advanced computer vision by enabling strong performance across diverse tasks through large-scale pretraining and supervised fine-tuning. However, they may underperform in domains with distribution shifts and scarce labels, where supervised fine-tuning may be infeasible. While continued self-supervised learning for model adaptation is common for generative language models, this strategy has not proven effective for vision-centric encoder models. To address this challenge, we introduce a novel formulation of self-supervised fine-tuning for vision foundation models, where the model is adapted to a new domain without requiring annotations, leveraging only short multi-view object-centric videos. Our method is referred to as VESSA: Video-based objEct-centric Self-Supervised Adaptation for visual foundation models. VESSA's training technique is based on a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
