MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation

Fu Rong; Meng Lan; Qian Zhang; Lefei Zhang

arXiv:2501.13667·cs.CV·August 11, 2025

MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation

Fu Rong, Meng Lan, Qian Zhang, Lefei Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces MPG-SAM 2, a novel framework that enhances referring video object segmentation by integrating multimodal encoding, mask priors, and global context aggregation to improve accuracy and temporal consistency.

Contribution

It proposes a unified multimodal encoder, mask prior generator, and hierarchical global-historical aggregator to adapt SAM 2 for offline RVOS tasks, addressing prompt translation and global context issues.

Findings

01

Outperforms existing RVOS methods on multiple benchmarks.

02

Effectively integrates multimodal and global context information.

03

Improves temporal consistency and segmentation accuracy.

Abstract

Referring video object segmentation (RVOS) aims to segment objects in a video according to textual descriptions, which requires the integration of multimodal information and temporal dynamics perception. The Segment Anything Model 2 (SAM 2) has shown great effectiveness across various video segmentation tasks. However, its application to offline RVOS is challenged by the translation of the text into effective prompts and a lack of global context awareness. In this paper, we propose a novel RVOS framework, termed MPG-SAM 2, to address these challenges. Specifically, MPG-SAM 2 employs a unified multimodal encoder to jointly encode video and textual features, generating semantically aligned video and text embeddings, along with multimodal class tokens. A mask prior generator utilizes the video embeddings and class tokens to create pseudo masks of target objects and global context. These…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rongfu-dsb/MPG-SAM2
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications

MethodsSegment Anything Model