ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction
Samyak Jain, Pradeep Yarlagadda, Shreyank Jyoti, Shyamgopal Karthik,, Ramanathan Subramanian, Vineet Gandhi

TL;DR
ViNet is a real-time, fully convolutional model for visual saliency prediction that outperforms state-of-the-art audio-visual models and even surpasses human performance on certain datasets, despite not using audio input.
Contribution
This paper introduces ViNet, a simple yet effective architecture for saliency prediction that outperforms existing models and challenges assumptions about audio's role in such tasks.
Findings
ViNet outperforms state-of-the-art models on nine datasets.
ViNet surpasses human performance on the AVE dataset.
Augmenting audio features does not affect ViNet's output after training.
Abstract
We propose the ViNet architecture for audio-visual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simple; it is causal and runs in real-time (60 fps). ViNet does not use audio as input and still outperforms the state-of-the-art audio-visual saliency prediction models on nine different datasets (three visual-only and six audio-visual datasets). ViNet also surpasses human performance on the CC, SIM and AUC metrics for the AVE dataset, and to our knowledge, it is the first network to do so. We also explore a variation of ViNet architecture by augmenting audio features into the decoder. To our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultisensory perception and integration · Subtitles and Audiovisual Media · Visual Attention and Saliency Detection
