M$^4$-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection

Jiyuan Liu; Jia Lin; Xiaofei Zhou; Runmin Cong; Deyang Liu; Zhi Liu

arXiv:2605.11760·cs.CV·May 13, 2026

M$^4$-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection

Jiyuan Liu, Jia Lin, Xiaofei Zhou, Runmin Cong, Deyang Liu, Zhi Liu

PDF

TL;DR

This paper introduces M$^4$-SAM, a novel multi-modal mixture-of-experts model with memory augmentation, enhancing RGB-D video salient object detection by addressing SAM2's limitations in spatial modeling, multi-scale feature utilization, and prompt dependence.

Contribution

It proposes a new framework combining modality-aware PEFT, hierarchical feature fusion, and prompt-free memory initialization for improved RGB-D VSOD performance.

Findings

01

Achieves state-of-the-art results on three RGB-D VSOD datasets.

02

Effectively balances spatial details and semantic context.

03

Enables zero-shot VSOD without manual prompts.

Abstract

The Segment Anything Model 2 (SAM2) has emerged as a foundation model for universal segmentation. Owing to its generalizable visual representations, SAM2 has been successfully applied to various downstream tasks. However, extending SAM2 to the RGB-D video salient object detection (RGB-D VSOD) task encounters three challenges including limited spatial modeling of linear LoRA, insufficient employment of SAM's multi-scale features, and dependence of initialization on explicit prompts. To address the issues, we present Multi-Modal Mixture-of-Experts with Memory-Augmented SAM (M $^{4}$ -SAM), which equips SAM2 with modality-related PEFT, hierarchical feature fusion, and prompt-free memory initialization. Firstly, we inject Modality-Aware MoE-LORA, which employs convolutional experts to encode local spatial priors and introduces a modality dispatcher for efficient multi-modal fine-tuning, into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.