UniST: Towards Unifying Saliency Transformer for Video Saliency   Prediction and Detection

Junwen Xiong; Peng Zhang; Chuanyue Li; Wei Huang; Yufei Zha; Tao You

arXiv:2309.08220·cs.CV·September 18, 2023·2 cites

UniST: Towards Unifying Saliency Transformer for Video Saliency Prediction and Detection

Junwen Xiong, Peng Zhang, Chuanyue Li, Wei Huang, Yufei Zha, Tao You

PDF

Open Access

TL;DR

UniST introduces a unified transformer framework that effectively models both video saliency prediction and salient object detection, achieving superior results across multiple benchmarks by leveraging multi-scale spatio-temporal features.

Contribution

This work is the first to design a transformer-based model that jointly addresses both video saliency prediction and salient object detection tasks.

Findings

01

Achieves superior performance on seven benchmarks.

02

Outperforms existing state-of-the-art methods.

03

Effectively models both tasks with a unified framework.

Abstract

Video saliency prediction and detection are thriving research domains that enable computers to simulate the distribution of visual attention akin to how humans perceiving dynamic scenes. While many approaches have crafted task-specific training paradigms for either video saliency prediction or video salient object detection tasks, few attention has been devoted to devising a generalized saliency modeling framework that seamlessly bridges both these distinct tasks. In this study, we introduce the Unified Saliency Transformer (UniST) framework, which comprehensively utilizes the essential attributes of video saliency prediction and video salient object detection. In addition to extracting representations of frame sequences, a saliency-aware transformer is designed to learn the spatio-temporal representations at progressively increased resolutions, while incorporating effective cross-scale…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Image and Video Quality Assessment · Advanced Image Fusion Techniques

MethodsAttention Is All You Need · Softmax · Dense Connections · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Linear Layer · Residual Connection · Adam · Multi-Head Attention · Layer Normalization