Joint framework with deep feature distillation and adaptive focal loss for weakly supervised audio tagging and acoustic event detection
Yunhao Liang, Yanhua Long, Yijie Li, Jiaen Liang, Yuping Wang

TL;DR
This paper introduces a joint training framework for weakly supervised audio tagging and acoustic event detection, utilizing deep feature distillation, adaptive focal loss, and event-specific post-processing to enhance performance.
Contribution
It proposes a novel combination of deep feature distillation, adaptive focal loss, and post-processing strategies within a teacher-student framework for improved weakly supervised audio analysis.
Findings
Achieved 81.2% F1-score in audio tagging
Achieved 49.8% F1-score in acoustic event detection
Demonstrated competitive performance on DCASE 2019 dataset
Abstract
A good joint training framework is very helpful to improve the performances of weakly supervised audio tagging (AT) and acoustic event detection (AED) simultaneously. In this study, we propose three methods to improve the best teacher-student framework in the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 Task 4 for both audio tagging and acoustic events detection tasks. A frame-level target-events based deep feature distillation is first proposed, which aims to leverage the potential of limited strong-labeled data in weakly supervised framework to learn better intermediate feature maps. Then, we propose an adaptive focal loss and two-stage training strategy to enable an effective and more accurate model training, where the contribution of hard and easy acoustic events to the total cost function can be automatically adjusted. Furthermore,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocal Loss
