TL;DR
SwinNet leverages Swin Transformer and edge-guided fusion to enhance salient object detection in RGB-D and RGB-T images, outperforming existing models by effectively capturing hierarchical features and boundary details.
Contribution
The paper introduces a novel cross-modality fusion model using Swin Transformer and edge guidance for improved RGB-D and RGB-T salient object detection.
Findings
Outperforms state-of-the-art models on multiple datasets.
Effectively captures hierarchical and boundary features.
Enhances cross-modality feature fusion.
Abstract
Convolutional neural networks (CNNs) are good at extracting contexture features within certain receptive fields, while transformers can model the global long-range dependency features. By absorbing the advantage of transformer and the merit of CNN, Swin Transformer shows strong feature representation ability. Based on it, we propose a cross-modality fusion model SwinNet for RGB-D and RGB-T salient object detection. It is driven by Swin Transformer to extract the hierarchical features, boosted by attention mechanism to bridge the gap between two modalities, and guided by edge information to sharp the contour of salient object. To be specific, two-stream Swin Transformer encoder first extracts multi-modality features, and then spatial alignment and channel re-calibration module is presented to optimize intra-level cross-modality features. To clarify the fuzzy boundary, edge-guided decoder…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Residual Connection · Stochastic Depth · Softmax · Label Smoothing
