Visual Saliency Transformer

Nian Liu; Ni Zhang; Kaiyuan Wan; Ling Shao; Junwei Han

arXiv:2104.12099·cs.CV·August 24, 2021

Visual Saliency Transformer

Nian Liu, Ni Zhang, Kaiyuan Wan, Ling Shao, Junwei Han

PDF

2 Repos

TL;DR

This paper introduces a novel pure transformer model called Visual Saliency Transformer (VST) for RGB and RGB-D salient object detection, leveraging global context modeling and multi-task learning to outperform existing CNN-based methods.

Contribution

The paper proposes a convolution-free transformer framework with multi-level token fusion, token upsampling, and a multi-task decoder for improved saliency and boundary detection.

Findings

01

Outperforms existing methods on benchmark datasets

02

Introduces a new transformer-based dense prediction paradigm

03

Provides high-resolution detection results

Abstract

Existing state-of-the-art saliency detection methods heavily rely on CNN-based architectures. Alternatively, we rethink this task from a convolution-free sequence-to-sequence perspective and predict saliency by modeling long-range dependencies, which can not be achieved by convolution. Specifically, we develop a novel unified model based on a pure transformer, namely, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD). It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Unlike conventional architectures used in Vision Transformer (ViT), we leverage multi-level token fusion and propose a new token upsampling method under the transformer framework to get high-resolution detection results. We also develop a token-based multi-task decoder to simultaneously perform saliency and boundary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Layer Normalization · Residual Connection · Multi-Head Attention · Byte Pair Encoding · Adam