TL;DR
This paper introduces STSANet, a novel video saliency prediction model that employs spatio-temporal self-attention modules to capture long-range relations across different time steps, outperforming existing methods.
Contribution
The paper proposes a new spatio-temporal self-attention network with multi-scale feature fusion for improved video saliency prediction.
Findings
Outperforms state-of-the-art models on DHF1K, Hollywood-2, UCF, and DIEM datasets.
Effectively captures long-range spatio-temporal relations.
Demonstrates the importance of multi-level feature integration.
Abstract
3D convolutional neural networks have achieved promising results for video tasks in computer vision, including video saliency prediction that is explored in this paper. However, 3D convolution encodes visual representation merely on fixed local spacetime according to its kernel size, while human attention is always attracted by relational visual features at different time. To overcome this limitation, we propose a novel Spatio-Temporal Self-Attention 3D Network (STSANet) for video saliency prediction, in which multiple Spatio-Temporal Self-Attention (STSA) modules are employed at different levels of 3D convolutional backbone to directly capture long-range relations between spatio-temporal features of different time steps. Besides, we propose an Attentional Multi-Scale Fusion (AMSF) module to integrate multi-level features with the perception of context in semantic and spatio-temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsConvolution · 3D Convolution
