Multi-level Attention Model for Weakly Supervised Audio Classification
Changsong Yu, Karim Said Barsim, Qiuqiang Kong, Bin Yang

TL;DR
This paper introduces a multi-level attention model for weakly supervised audio classification, leveraging multiple attention modules at different neural network layers to improve prediction accuracy on large-scale datasets.
Contribution
The paper presents an extension of the single-level attention model by incorporating multiple attention modules at various neural network layers for better audio event detection.
Findings
Achieved a mean average precision of 0.360 on Audio Set
Outperformed previous single-level attention model (0.327)
Surpassed Google baseline (0.314)
Abstract
In this paper, we propose a multi-level attention model to solve the weakly labelled audio classification problem. The objective of audio classification is to predict the presence or absence of audio events in an audio clip. Recently, Google published a large scale weakly labelled dataset called Audio Set, where each audio clip contains only the presence or absence of the audio events, without the onset and offset time of the audio events. Our multi-level attention model is an extension to the previously proposed single-level attention model. It consists of several attention modules applied on intermediate neural network layers. The output of these attention modules are concatenated to a vector followed by a multi-label classifier to make the final prediction of each class. Experiments shown that our model achieves a mean average precision (mAP) of 0.360, outperforms the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Digital Media Forensic Detection
