Joint Analysis of Acoustic Scenes and Sound Events Based on Semi-Supervised Training of Sound Events With Partial Labels
Keisuke Imoto

TL;DR
This paper introduces a semi-supervised, multitask learning framework that leverages acoustic scene context and partial labels to improve sound event detection while reducing annotation costs.
Contribution
It proposes a novel joint acoustic scene and sound event analysis method using partial labels and semi-supervised learning, incorporating label refinement via self-distillation.
Findings
Improved sound event detection accuracy with reduced annotation effort.
Effective use of acoustic scene context to construct partial labels.
Enhanced model performance through label refinement techniques.
Abstract
Annotating time boundaries of sound events is labor-intensive, limiting the scalability of strongly supervised learning in audio detection. To reduce annotation costs, weakly-supervised learning with only clip-level labels has been widely adopted. As an alternative, partial label learning offers a cost-effective approach, where a set of possible labels is provided instead of exact weak annotations. However, partial label learning for audio analysis remains largely unexplored. Motivated by the observation that acoustic scenes provide contextual information for constructing a set of possible sound events, we utilize acoustic scene information to construct partial labels of sound events. On the basis of this idea, in this paper, we propose a multitask learning framework that jointly performs acoustic scene classification and sound event detection with partial labels of sound events. While…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
