Joint Analysis of Acoustic Scenes and Sound Events with Weakly labeled Data
Shunsuke Tsubaki, Keisuke Imoto, Nobutaka Ono

TL;DR
This paper introduces a multi-task learning approach for joint acoustic scene and sound event analysis using weak labels, reducing annotation effort while improving performance over traditional methods.
Contribution
It proposes a novel MTL framework with weak labels and evaluates multiple pooling functions, demonstrating superior results in scene and event detection tasks.
Findings
Weakly supervised MTL outperforms single-task models.
Multiple pooling functions are evaluated for effectiveness.
The method improves both scene classification and event detection accuracy.
Abstract
Considering that acoustic scenes and sound events are closely related to each other, in some previous papers, a joint analysis of acoustic scenes and sound events utilizing multitask learning (MTL)-based neural networks was proposed. In conventional methods, a strongly supervised scheme is applied to sound event detection in MTL models, which requires strong labels of sound events in model training; however, annotating strong event labels is quite time-consuming. In this paper, we thus propose a method for the joint analysis of acoustic scenes and sound events based on the MTL framework with weak labels of sound events. In particular, in the proposed method, we introduce the multiple-instance learning scheme for weakly supervised training of sound event detection and evaluate four pooling functions, namely, max pooling, average pooling, exponential softmax pooling, and attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
