Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data
Haytham M. Fayek, Anurag Kumar

TL;DR
This paper introduces a multi-modal audiovisual fusion model with attention mechanism for sound recognition in weakly labeled videos, significantly improving accuracy over single-modal and previous fusion models.
Contribution
The paper presents a novel audiovisual fusion model with attention that effectively combines audio and visual data for sound recognition, outperforming existing methods.
Findings
Achieved a mean Average Precision of 46.16 on AudioSet.
Outperformed prior state-of-the-art models by approximately 4.35 mAP.
Demonstrated the effectiveness of multi-modal fusion in sound recognition.
Abstract
Recognizing sounds is a key aspect of computational audio scene analysis and machine perception. In this paper, we advocate that sound recognition is inherently a multi-modal audiovisual task in that it is easier to differentiate sounds using both the audio and visual modalities as opposed to one or the other. We present an audiovisual fusion model that learns to recognize sounds from weakly labeled video recordings. The proposed fusion model utilizes an attention mechanism to dynamically combine the outputs of the individual audio and visual models. Experiments on the large scale sound events dataset, AudioSet, demonstrate the efficacy of the proposed model, which outperforms the single-modal models, and state-of-the-art fusion and multi-modal models. We achieve a mean Average Precision (mAP) of 46.16 on Audioset, outperforming prior state of the art by approximately +4.35 mAP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
