SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation
Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone,, Giuseppe Averta

TL;DR
SAMWISE enhances the SAM2 model with natural language understanding and temporal modeling, enabling effective streaming video segmentation without fine-tuning, and achieves state-of-the-art results with minimal additional parameters.
Contribution
It introduces a novel adapter module for SAM2 that incorporates temporal and multi-modal cues, improving streaming video segmentation in RVOS tasks.
Findings
Achieves state-of-the-art performance on multiple benchmarks.
Adds less than 5 million parameters to SAM2.
Effectively models temporal context without fine-tuning.
Abstract
Referring Video Object Segmentation (RVOS) relies on natural language expressions to segment an object in a video clip. Existing methods restrict reasoning either to independent short clips, losing global context, or process the entire video offline, impairing their application in a streaming fashion. In this work, we aim to surpass these limitations and design an RVOS method capable of effectively operating in streaming-like scenarios while retaining contextual information from past frames. We build upon the Segment-Anything 2 (SAM2) model, that provides robust segmentation and tracking capabilities and is naturally suited for streaming processing. We make SAM2 wiser, by empowering it with natural language understanding and explicit temporal modeling at the feature extraction stage, without fine-tuning its weights, and without outsourcing modality interaction to external models. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis
MethodsAdapter · Focus
