Temporal-Spatial Feature Pyramid for Video Saliency Detection
Qinyao Chang, Shiping Zhu

TL;DR
This paper introduces a 3D encoder-decoder architecture that effectively combines multi-scale, spatial, and temporal features for real-time video saliency detection, significantly outperforming existing methods.
Contribution
The paper proposes a novel 3D fully convolutional encoder-decoder model that integrates multi-level features with temporal information for improved video saliency detection.
Findings
Outperforms state-of-the-art methods on multiple benchmarks
Operates in real time with high accuracy
Effectively combines scale, space, and time features
Abstract
Multi-level features are important for saliency detection. Better combination and use of multi-level features with time information can greatly improve the accuracy of the video saliency model. In order to fully combine multi-level features and make it serve the video saliency model, we propose a 3D fully convolutional encoder-decoder architecture for video saliency detection, which combines scale, space and time information for video saliency modeling. The encoder extracts multi-scale temporal-spatial features from the input continuous video frames, and then constructs temporal-spatial feature pyramid through temporal-spatial convolution and top-down feature integration. The decoder performs hierarchical decoding of temporal-spatial features from different scales, and finally produces a saliency map from the integration of multiple video frames. Our model is simple yet effective, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques · Image and Video Quality Assessment
MethodsConvolution
