Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild
Haoran Wang, Zekun Li, Jian Zhang, Lei Qi, Yinghuan Shi

TL;DR
This paper introduces CAV-SAM, a lightweight test-time adaptation method that models reference-target image pairs as pseudo videos, enabling SAM2 to improve reference segmentation in the wild without extensive meta-training.
Contribution
It proposes a novel pseudo video perspective for reference segmentation, utilizing diffusion models and test-time fine-tuning to adapt SAM2 efficiently in real-world scenarios.
Findings
Achieved over 5% improvement over SOTA methods on benchmark datasets.
Introduced a diffusion-based semantic transition module for better semantic understanding.
Developed a test-time geometric alignment module for precise adaptation.
Abstract
Large vision models like the Segment Anything Model (SAM) exhibit significant limitations when applied to downstream tasks in the wild. Consequently, reference segmentation, which leverages reference images and their corresponding masks to impart novel knowledge to the model, emerges as a promising new direction for adapting vision models. However, existing reference segmentation approaches predominantly rely on meta-learning, which still necessitates an extensive meta-training process and brings massive data and computational cost. In this study, we propose a novel approach by representing the inherent correspondence between reference-target image pairs as a pseudo video. This perspective allows the latest version of SAM, known as SAM2, which is equipped with interactive video object segmentation (iVOS) capabilities, to be adapted to downstream tasks in a lightweight manner. We term…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
