ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency   Prediction

Samyak Jain; Pradeep Yarlagadda; Shreyank Jyoti; Shyamgopal Karthik,; Ramanathan Subramanian; Vineet Gandhi

arXiv:2012.06170·cs.CV·August 10, 2021

ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction

Samyak Jain, Pradeep Yarlagadda, Shreyank Jyoti, Shyamgopal Karthik,, Ramanathan Subramanian, Vineet Gandhi

PDF

Open Access 1 Repo

TL;DR

ViNet is a real-time, fully convolutional model for visual saliency prediction that outperforms state-of-the-art audio-visual models and even surpasses human performance on certain datasets, despite not using audio input.

Contribution

This paper introduces ViNet, a simple yet effective architecture for saliency prediction that outperforms existing models and challenges assumptions about audio's role in such tasks.

Findings

01

ViNet outperforms state-of-the-art models on nine datasets.

02

ViNet surpasses human performance on the AVE dataset.

03

Augmenting audio features does not affect ViNet's output after training.

Abstract

We propose the ViNet architecture for audio-visual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simple; it is causal and runs in real-time (60 fps). ViNet does not use audio as input and still outperforms the state-of-the-art audio-visual saliency prediction models on nine different datasets (three visual-only and six audio-visual datasets). ViNet also surpasses human performance on the CC, SIM and AUC metrics for the AVE dataset, and to our knowledge, it is the first network to do so. We also explore a variation of ViNet architecture by augmenting audio features into the decoder. To our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

samyak0210/ViNet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultisensory perception and integration · Subtitles and Audiovisual Media · Visual Attention and Saliency Detection