Transformer-based Network for RGB-D Saliency Detection
Yue Wang, Xu Jia, Lu Zhang, Yuke Li, James Elder, Huchuan Lu

TL;DR
This paper introduces a transformer-based network for RGB-D saliency detection that effectively captures long-range dependencies and fuses multi-scale, multi-modal features, outperforming existing methods on benchmark datasets.
Contribution
The paper proposes a novel transformer-based architecture with modules for feature enhancement and fusion, simplifying design and improving performance in RGB-D saliency detection.
Findings
Outperforms state-of-the-art methods on six benchmark datasets
Effectively captures long-range dependencies in feature fusion
Simplifies model design using transformer operations
Abstract
RGB-D saliency detection integrates information from both RGB images and depth maps to improve prediction of salient regions under challenging conditions. The key to RGB-D saliency detection is to fully mine and fuse information at multiple scales across the two modalities. Previous approaches tend to apply the multi-scale and multi-modal fusion separately via local operations, which fails to capture long-range dependencies. Here we propose a transformer-based network to address this issue. Our proposed architecture is composed of two modules: a transformer-based within-modality feature enhancement module (TWFEM) and a transformer-based feature fusion module (TFFM). TFFM conducts a sufficient feature fusion by integrating features from multiple scales and two modalities over all positions simultaneously. TWFEM enhances feature on each scale by selecting and integrating complementary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
