VST++: Efficient and Stronger Visual Saliency Transformer

Nian Liu; Ziyang Luo; Ni Zhang; Junwei Han

arXiv:2310.11725·cs.CV·April 12, 2024·1 cites

VST++: Efficient and Stronger Visual Saliency Transformer

Nian Liu, Ziyang Luo, Ni Zhang, Junwei Han

PDF

Open Access

TL;DR

VST++ is an improved transformer-based model for salient object detection that enhances efficiency and accuracy by introducing novel modules and encoding methods, outperforming existing methods with reduced computational costs.

Contribution

The paper presents VST++, a more efficient and stronger version of VST, featuring a novel attention module, depth encoding, and token supervision to improve salient object detection performance.

Findings

01

Outperforms existing methods on RGB, RGB-D, and RGB-T benchmarks.

02

Reduces computational costs by 25% compared to VST.

03

Demonstrates strong generalization and improved accuracy.

Abstract

While previous CNN-based models have exhibited promising results for salient object detection (SOD), their ability to explore global long-range dependencies is restricted. Our previous work, the Visual Saliency Transformer (VST), addressed this constraint from a transformer-based sequence-to-sequence perspective, to unify RGB and RGB-D SOD. In VST, we developed a multi-task transformer decoder that concurrently predicts saliency and boundary outcomes in a pure transformer architecture. Moreover, we introduced a novel token upsampling method called reverse T2T for predicting a high-resolution saliency map effortlessly within transformer-based structures. Building upon the VST model, we further propose an efficient and stronger VST version in this work, i.e. VST++. To mitigate the computational costs of the VST model, we propose a Select-Integrate Attention (SIA) module, partitioning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Advanced Image Fusion Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Linear Layer · Residual Connection · Absolute Position Encodings · Layer Normalization · Softmax · Adam · Byte Pair Encoding