TANet: Transformer-based Asymmetric Network for RGB-D Salient Object   Detection

Chang Liu; Gang Yang; Shuo Wang; Hangxu Wang; Yunhua Zhang; Yutao; Wang

arXiv:2207.01172·cs.CV·July 5, 2022·1 cites

TANet: Transformer-based Asymmetric Network for RGB-D Salient Object Detection

Chang Liu, Gang Yang, Shuo Wang, Hangxu Wang, Yunhua Zhang, Yutao, Wang

PDF

Open Access 1 Repo

TL;DR

TANet introduces a transformer-based asymmetric network that effectively combines global semantic and spatial features from RGB and depth data, improving salient object detection performance.

Contribution

The paper proposes a novel asymmetric hybrid encoder using Transformer and lightweight CNN, along with a cross-modal fusion and edge enhancement modules for superior RGB-D SOD.

Findings

01

Achieves state-of-the-art results on six datasets

02

Outperforms 14 existing RGB-D methods

03

Provides sharper object contours

Abstract

Existing RGB-D SOD methods mainly rely on a symmetric two-stream CNN-based network to extract RGB and depth channel features separately. However, there are two problems with the symmetric conventional network structure: first, the ability of CNN in learning global contexts is limited; second, the symmetric two-stream structure ignores the inherent differences between modalities. In this paper, we propose a Transformer-based asymmetric network (TANet) to tackle the issues mentioned above. We employ the powerful feature extraction capability of Transformer (PVTv2) to extract global semantic information from RGB data and design a lightweight CNN backbone (LWDepthNet) to extract spatial structure information from depth data without pre-training. The asymmetric hybrid encoder (AHE) effectively reduces the number of parameters in the model while increasing speed without sacrificing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lc012463/tanet
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Video Surveillance and Tracking Methods

MethodsAttention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Softmax · Multi-Head Attention · Residual Connection · Dense Connections · Byte Pair Encoding · Absolute Position Encodings · Position-Wise Feed-Forward Layer