TL;DR
This paper introduces a novel pure transformer model called Visual Saliency Transformer (VST) for RGB and RGB-D salient object detection, leveraging global context modeling and multi-task learning to outperform existing CNN-based methods.
Contribution
The paper proposes a convolution-free transformer framework with multi-level token fusion, token upsampling, and a multi-task decoder for improved saliency and boundary detection.
Findings
Outperforms existing methods on benchmark datasets
Introduces a new transformer-based dense prediction paradigm
Provides high-resolution detection results
Abstract
Existing state-of-the-art saliency detection methods heavily rely on CNN-based architectures. Alternatively, we rethink this task from a convolution-free sequence-to-sequence perspective and predict saliency by modeling long-range dependencies, which can not be achieved by convolution. Specifically, we develop a novel unified model based on a pure transformer, namely, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD). It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Unlike conventional architectures used in Vision Transformer (ViT), we leverage multi-level token fusion and propose a new token upsampling method under the transformer framework to get high-resolution detection results. We also develop a token-based multi-task decoder to simultaneously perform saliency and boundary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Layer Normalization · Residual Connection · Multi-Head Attention · Byte Pair Encoding · Adam
