M$^4$-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection
Jiyuan Liu, Jia Lin, Xiaofei Zhou, Runmin Cong, Deyang Liu, Zhi Liu

TL;DR
This paper introduces M$^4$-SAM, a novel multi-modal mixture-of-experts model with memory augmentation, enhancing RGB-D video salient object detection by addressing SAM2's limitations in spatial modeling, multi-scale feature utilization, and prompt dependence.
Contribution
It proposes a new framework combining modality-aware PEFT, hierarchical feature fusion, and prompt-free memory initialization for improved RGB-D VSOD performance.
Findings
Achieves state-of-the-art results on three RGB-D VSOD datasets.
Effectively balances spatial details and semantic context.
Enables zero-shot VSOD without manual prompts.
Abstract
The Segment Anything Model 2 (SAM2) has emerged as a foundation model for universal segmentation. Owing to its generalizable visual representations, SAM2 has been successfully applied to various downstream tasks. However, extending SAM2 to the RGB-D video salient object detection (RGB-D VSOD) task encounters three challenges including limited spatial modeling of linear LoRA, insufficient employment of SAM's multi-scale features, and dependence of initialization on explicit prompts. To address the issues, we present Multi-Modal Mixture-of-Experts with Memory-Augmented SAM (M-SAM), which equips SAM2 with modality-related PEFT, hierarchical feature fusion, and prompt-free memory initialization. Firstly, we inject Modality-Aware MoE-LORA, which employs convolutional experts to encode local spatial priors and introduces a modality dispatcher for efficient multi-modal fine-tuning, into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
