Relevance-guided Audio Visual Fusion for Video Saliency Prediction

Li Yu; Xuanzhe Sun; Pan Gao; Moncef Gabbouj

arXiv:2411.11454·cs.CV·November 19, 2024

Relevance-guided Audio Visual Fusion for Video Saliency Prediction

Li Yu, Xuanzhe Sun, Pan Gao, Moncef Gabbouj

PDF

Open Access

TL;DR

This paper introduces AVRSP, a novel audio-visual saliency prediction network that dynamically fuses audio and visual features based on their semantic relevance, improving the prediction of human visual attention in videos.

Contribution

The paper proposes a relevance-guided fusion module and multi-scale feature integration techniques to enhance audio-visual saliency prediction accuracy.

Findings

01

Achieves competitive performance on six datasets.

02

Effectively handles audio-visual inconsistency issues.

03

Improves multi-scale visual feature utilization.

Abstract

Audio data, often synchronized with video frames, plays a crucial role in guiding the audience's visual attention. Incorporating audio information into video saliency prediction tasks can enhance the prediction of human visual behavior. However, existing audio-visual saliency prediction methods often directly fuse audio and visual features, which ignore the possibility of inconsistency between the two modalities, such as when the audio serves as background music. To address this issue, we propose a novel relevance-guided audio-visual saliency prediction network dubbed AVRSP. Specifically, the Relevance-guided Audio-Visual feature Fusion module (RAVF) dynamically adjusts the retention of audio features based on the semantic relevance between audio and visual elements, thereby refining the integration process with visual features. Furthermore, the Multi-scale feature Synergy (MS) module…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsColor perception and design · Multisensory perception and integration