TANet: Transformer-based Asymmetric Network for RGB-D Salient Object Detection
Chang Liu, Gang Yang, Shuo Wang, Hangxu Wang, Yunhua Zhang, Yutao, Wang

TL;DR
TANet introduces a transformer-based asymmetric network that effectively combines global semantic and spatial features from RGB and depth data, improving salient object detection performance.
Contribution
The paper proposes a novel asymmetric hybrid encoder using Transformer and lightweight CNN, along with a cross-modal fusion and edge enhancement modules for superior RGB-D SOD.
Findings
Achieves state-of-the-art results on six datasets
Outperforms 14 existing RGB-D methods
Provides sharper object contours
Abstract
Existing RGB-D SOD methods mainly rely on a symmetric two-stream CNN-based network to extract RGB and depth channel features separately. However, there are two problems with the symmetric conventional network structure: first, the ability of CNN in learning global contexts is limited; second, the symmetric two-stream structure ignores the inherent differences between modalities. In this paper, we propose a Transformer-based asymmetric network (TANet) to tackle the issues mentioned above. We employ the powerful feature extraction capability of Transformer (PVTv2) to extract global semantic information from RGB data and design a lightweight CNN backbone (LWDepthNet) to extract spatial structure information from depth data without pre-training. The asymmetric hybrid encoder (AHE) effectively reduces the number of parameters in the model while increasing speed without sacrificing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Video Surveillance and Tracking Methods
MethodsAttention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Softmax · Multi-Head Attention · Residual Connection · Dense Connections · Byte Pair Encoding · Absolute Position Encodings · Position-Wise Feed-Forward Layer
