VST++: Efficient and Stronger Visual Saliency Transformer
Nian Liu, Ziyang Luo, Ni Zhang, Junwei Han

TL;DR
VST++ is an improved transformer-based model for salient object detection that enhances efficiency and accuracy by introducing novel modules and encoding methods, outperforming existing methods with reduced computational costs.
Contribution
The paper presents VST++, a more efficient and stronger version of VST, featuring a novel attention module, depth encoding, and token supervision to improve salient object detection performance.
Findings
Outperforms existing methods on RGB, RGB-D, and RGB-T benchmarks.
Reduces computational costs by 25% compared to VST.
Demonstrates strong generalization and improved accuracy.
Abstract
While previous CNN-based models have exhibited promising results for salient object detection (SOD), their ability to explore global long-range dependencies is restricted. Our previous work, the Visual Saliency Transformer (VST), addressed this constraint from a transformer-based sequence-to-sequence perspective, to unify RGB and RGB-D SOD. In VST, we developed a multi-task transformer decoder that concurrently predicts saliency and boundary outcomes in a pure transformer architecture. Moreover, we introduced a novel token upsampling method called reverse T2T for predicting a high-resolution saliency map effortlessly within transformer-based structures. Building upon the VST model, we further propose an efficient and stronger VST version in this work, i.e. VST++. To mitigate the computational costs of the VST model, we propose a Select-Integrate Attention (SIA) module, partitioning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Advanced Image Fusion Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Linear Layer · Residual Connection · Absolute Position Encodings · Layer Normalization · Softmax · Adam · Byte Pair Encoding
