Minimalistic Video Saliency Prediction via Efficient Decoder & Spatio Temporal Action Cues
Rohit Girmaji, Siddharth Jain, Bhav Beri, Sarthak Bansal, Vineet, Gandhi

TL;DR
This paper presents ViNet variants for efficient video saliency prediction, with ViNet-S being lightweight and ViNet-A incorporating spatio-temporal cues, achieving state-of-the-art results with high speed and low resource usage.
Contribution
Introduces ViNet-S and ViNet-A models that improve efficiency and incorporate spatio-temporal cues for superior video saliency prediction.
Findings
ViNet-S achieves over 1000fps.
Ensemble of ViNet-S and ViNet-A outperforms existing models.
Models outperform transformer-based approaches in efficiency and accuracy.
Abstract
This paper introduces ViNet-S, a 36MB model based on the ViNet architecture with a U-Net design, featuring a lightweight decoder that significantly reduces model size and parameters without compromising performance. Additionally, ViNet-A (148MB) incorporates spatio-temporal action localization (STAL) features, differing from traditional video saliency models that use action classification backbones. Our studies show that an ensemble of ViNet-S and ViNet-A, by averaging predicted saliency maps, achieves state-of-the-art performance on three visual-only and six audio-visual saliency datasets, outperforming transformer-based models in both parameter efficiency and real-time performance, with ViNet-S reaching over 1000fps.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Image and Video Quality Assessment · Advanced Image Fusion Techniques
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Concatenated Skip Connection · Max Pooling · Convolution · U-Net · Channel Shuffle · 3 Dimensional Convolutional Neural Network
