SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation

Claudia Cuttano; Gabriele Trivigno; Gabriele Rosi; Carlo Masone,; Giuseppe Averta

arXiv:2411.17646·cs.CV·March 26, 2025

SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation

Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone,, Giuseppe Averta

PDF

Open Access 1 Repo

TL;DR

SAMWISE enhances the SAM2 model with natural language understanding and temporal modeling, enabling effective streaming video segmentation without fine-tuning, and achieves state-of-the-art results with minimal additional parameters.

Contribution

It introduces a novel adapter module for SAM2 that incorporates temporal and multi-modal cues, improving streaming video segmentation in RVOS tasks.

Findings

01

Achieves state-of-the-art performance on multiple benchmarks.

02

Adds less than 5 million parameters to SAM2.

03

Effectively models temporal context without fine-tuning.

Abstract

Referring Video Object Segmentation (RVOS) relies on natural language expressions to segment an object in a video clip. Existing methods restrict reasoning either to independent short clips, losing global context, or process the entire video offline, impairing their application in a streaming fashion. In this work, we aim to surpass these limitations and design an RVOS method capable of effectively operating in streaming-like scenarios while retaining contextual information from past frames. We build upon the Segment-Anything 2 (SAM2) model, that provides robust segmentation and tracking capabilities and is naturally suited for streaming processing. We make SAM2 wiser, by empowering it with natural language understanding and explicit temporal modeling at the feature extraction stage, without fine-tuning its weights, and without outsourcing modality interaction to external models. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

claudiacuttano/samwise
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis

MethodsAdapter · Focus