Minimalistic Video Saliency Prediction via Efficient Decoder & Spatio   Temporal Action Cues

Rohit Girmaji; Siddharth Jain; Bhav Beri; Sarthak Bansal; Vineet; Gandhi

arXiv:2502.00397·cs.CV·February 4, 2025

Minimalistic Video Saliency Prediction via Efficient Decoder & Spatio Temporal Action Cues

Rohit Girmaji, Siddharth Jain, Bhav Beri, Sarthak Bansal, Vineet, Gandhi

PDF

Open Access

TL;DR

This paper presents ViNet variants for efficient video saliency prediction, with ViNet-S being lightweight and ViNet-A incorporating spatio-temporal cues, achieving state-of-the-art results with high speed and low resource usage.

Contribution

Introduces ViNet-S and ViNet-A models that improve efficiency and incorporate spatio-temporal cues for superior video saliency prediction.

Findings

01

ViNet-S achieves over 1000fps.

02

Ensemble of ViNet-S and ViNet-A outperforms existing models.

03

Models outperform transformer-based approaches in efficiency and accuracy.

Abstract

This paper introduces ViNet-S, a 36MB model based on the ViNet architecture with a U-Net design, featuring a lightweight decoder that significantly reduces model size and parameters without compromising performance. Additionally, ViNet-A (148MB) incorporates spatio-temporal action localization (STAL) features, differing from traditional video saliency models that use action classification backbones. Our studies show that an ensemble of ViNet-S and ViNet-A, by averaging predicted saliency maps, achieves state-of-the-art performance on three visual-only and six audio-visual saliency datasets, outperforming transformer-based models in both parameter efficiency and real-time performance, with ViNet-S reaching over 1000fps.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Image and Video Quality Assessment · Advanced Image Fusion Techniques

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Concatenated Skip Connection · Max Pooling · Convolution · U-Net · Channel Shuffle · 3 Dimensional Convolutional Neural Network