Weakly Supervised Visual-Auditory Fixation Prediction with Multigranularity Perception
Guotao Wang, Chenglizhao Chen, Deng-Ping Fan, Aimin Hao, and Hong Qin

TL;DR
This paper introduces a weakly supervised method for visual-audio fixation prediction that leverages class activation mapping and multi-granularity perception, reducing the need for large-scale fixation datasets while achieving competitive performance.
Contribution
It proposes a novel weakly supervised approach using only video category tags and class activation mapping to predict visual-audio fixations, with an upgraded multi-granularity perception mechanism.
Findings
The method achieves performance comparable to fully supervised models.
It reduces the reliance on large-scale fixation datasets.
The approach is applicable even without video tags, broadening its usability.
Abstract
Thanks to the rapid advances in deep learning techniques and the wide availability of large-scale training sets, the performance of video saliency detection models has been improving steadily and significantly. However, deep learning-based visualaudio fixation prediction is still in its infancy. At present, only a few visual-audio sequences have been furnished, with real fixations being recorded in real visual-audio environments. Hence, it would be neither efficient nor necessary to recollect real fixations under the same visual-audio circumstances. To address this problem, this paper promotes a novel approach in a weakly supervised manner to alleviate the demand of large-scale training sets for visual-audio model training. By using only the video category tags, we propose the selective class activation mapping (SCAM) and its upgrade (SCAM+). In the spatial-temporal-audio circumstance,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Image Enhancement Techniques · Image and Video Quality Assessment
