MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation
Fu Rong, Meng Lan, Qian Zhang, Lefei Zhang

TL;DR
This paper introduces MPG-SAM 2, a novel framework that enhances referring video object segmentation by integrating multimodal encoding, mask priors, and global context aggregation to improve accuracy and temporal consistency.
Contribution
It proposes a unified multimodal encoder, mask prior generator, and hierarchical global-historical aggregator to adapt SAM 2 for offline RVOS tasks, addressing prompt translation and global context issues.
Findings
Outperforms existing RVOS methods on multiple benchmarks.
Effectively integrates multimodal and global context information.
Improves temporal consistency and segmentation accuracy.
Abstract
Referring video object segmentation (RVOS) aims to segment objects in a video according to textual descriptions, which requires the integration of multimodal information and temporal dynamics perception. The Segment Anything Model 2 (SAM 2) has shown great effectiveness across various video segmentation tasks. However, its application to offline RVOS is challenged by the translation of the text into effective prompts and a lack of global context awareness. In this paper, we propose a novel RVOS framework, termed MPG-SAM 2, to address these challenges. Specifically, MPG-SAM 2 employs a unified multimodal encoder to jointly encode video and textual features, generating semantically aligned video and text embeddings, along with multimodal class tokens. A mask prior generator utilizes the video embeddings and class tokens to create pseudo masks of target objects and global context. These…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
MethodsSegment Anything Model
