Relevance-guided Audio Visual Fusion for Video Saliency Prediction
Li Yu, Xuanzhe Sun, Pan Gao, Moncef Gabbouj

TL;DR
This paper introduces AVRSP, a novel audio-visual saliency prediction network that dynamically fuses audio and visual features based on their semantic relevance, improving the prediction of human visual attention in videos.
Contribution
The paper proposes a relevance-guided fusion module and multi-scale feature integration techniques to enhance audio-visual saliency prediction accuracy.
Findings
Achieves competitive performance on six datasets.
Effectively handles audio-visual inconsistency issues.
Improves multi-scale visual feature utilization.
Abstract
Audio data, often synchronized with video frames, plays a crucial role in guiding the audience's visual attention. Incorporating audio information into video saliency prediction tasks can enhance the prediction of human visual behavior. However, existing audio-visual saliency prediction methods often directly fuse audio and visual features, which ignore the possibility of inconsistency between the two modalities, such as when the audio serves as background music. To address this issue, we propose a novel relevance-guided audio-visual saliency prediction network dubbed AVRSP. Specifically, the Relevance-guided Audio-Visual feature Fusion module (RAVF) dynamically adjusts the retention of audio features based on the semantic relevance between audio and visual elements, thereby refining the integration process with visual features. Furthermore, the Multi-scale feature Synergy (MS) module…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsColor perception and design · Multisensory perception and integration
