Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild

Haoran Wang; Zekun Li; Jian Zhang; Lei Qi; Yinghuan Shi

arXiv:2508.07759·cs.CV·August 12, 2025

Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild

Haoran Wang, Zekun Li, Jian Zhang, Lei Qi, Yinghuan Shi

PDF

Open Access

TL;DR

This paper introduces CAV-SAM, a lightweight test-time adaptation method that models reference-target image pairs as pseudo videos, enabling SAM2 to improve reference segmentation in the wild without extensive meta-training.

Contribution

It proposes a novel pseudo video perspective for reference segmentation, utilizing diffusion models and test-time fine-tuning to adapt SAM2 efficiently in real-world scenarios.

Findings

01

Achieved over 5% improvement over SOTA methods on benchmark datasets.

02

Introduced a diffusion-based semantic transition module for better semantic understanding.

03

Developed a test-time geometric alignment module for precise adaptation.

Abstract

Large vision models like the Segment Anything Model (SAM) exhibit significant limitations when applied to downstream tasks in the wild. Consequently, reference segmentation, which leverages reference images and their corresponding masks to impart novel knowledge to the model, emerges as a promising new direction for adapting vision models. However, existing reference segmentation approaches predominantly rely on meta-learning, which still necessitates an extensive meta-training process and brings massive data and computational cost. In this study, we propose a novel approach by representing the inherent correspondence between reference-target image pairs as a pseudo video. This perspective allows the latest version of SAM, known as SAM2, which is equipped with interactive video object segmentation (iVOS) capabilities, to be adapted to downstream tasks in a lightweight manner. We term…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications