Depth-Cooperated Trimodal Network for Video Salient Object Detection
Yukang Lu, Dingyao Min, Keren Fu, Qijun Zhao

TL;DR
This paper introduces DCTNet, a novel depth-cooperated trimodal network for video salient object detection that integrates depth, RGB, and optical flow information using specialized modules to improve detection accuracy.
Contribution
The paper pioneers the incorporation of depth information into video salient object detection through a multi-modal attention module and a refinement fusion module, enhancing detection performance.
Findings
Outperforms 12 state-of-the-art methods on five benchmark datasets.
Demonstrates the effectiveness of depth information in VSOD.
Validates the necessity of depth for improved detection accuracy.
Abstract
Depth can provide useful geographical cues for salient object detection (SOD), and has been proven helpful in recent RGB-D SOD methods. However, existing video salient object detection (VSOD) methods only utilize spatiotemporal information and seldom exploit depth information for detection. In this paper, we propose a depth-cooperated trimodal network, called DCTNet for VSOD, which is a pioneering work to incorporate depth information to assist VSOD. To this end, we first generate depth from RGB frames, and then propose an approach to treat the three modalities unequally. Specifically, a multi-modal attention module (MAM) is designed to model multi-modal long-range dependencies between the main modality (RGB) and the two auxiliary modalities (depth, optical flow). We also introduce a refinement fusion module (RFM) to suppress noises in each modality and select useful information…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Virtual Reality Applications and Impacts · Face Recognition and Perception
