Unleashing the Power of Motion and Depth: A Selective Fusion Strategy for RGB-D Video Salient Object Detection

Jiahao He; Daerji Suolang; Keren Fu; Qijun Zhao

arXiv:2507.21857·cs.CV·July 30, 2025

Unleashing the Power of Motion and Depth: A Selective Fusion Strategy for RGB-D Video Salient Object Detection

Jiahao He, Daerji Suolang, Keren Fu, Qijun Zhao

PDF

TL;DR

This paper introduces SMFNet, a novel RGB-D video salient object detection model that effectively fuses motion and depth cues through a selective strategy, significantly improving detection accuracy over existing methods.

Contribution

The paper proposes a selective cross-modal fusion framework with pixel-level and multi-dimensional attention modules to better utilize optical flow and depth in RGB-D VSOD.

Findings

01

SMFNet outperforms 19 state-of-the-art models on multiple datasets.

02

The selective fusion strategy effectively leverages motion and depth cues.

03

Comprehensive benchmarks validate the model's superiority.

Abstract

Applying salient object detection (SOD) to RGB-D videos is an emerging task called RGB-D VSOD and has recently gained increasing interest, due to considerable performance gains of incorporating motion and depth and that RGB-D videos can be easily captured now in daily life. Existing RGB-D VSOD models have different attempts to derive motion cues, in which extracting motion information explicitly from optical flow appears to be a more effective and promising alternative. Despite this, there remains a key issue that how to effectively utilize optical flow and depth to assist the RGB modality in SOD. Previous methods always treat optical flow and depth equally with respect to model designs, without explicitly considering their unequal contributions in individual scenarios, limiting the potential of motion and depth. To address this issue and unleash the power of motion and depth, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.