Robust Feature Learning on Long-Duration Sounds for Acoustic Scene Classification
Yuzhong Wu, Tan Lee

TL;DR
This paper proposes a robust feature learning framework for acoustic scene classification that down-weights long-duration sounds during training, improving generalization across unseen devices and locations.
Contribution
It introduces a novel RFL framework that uses an auxiliary classifier and loss function to enhance robustness of CNN-based ASC systems against domain variations.
Findings
Improved accuracy on unseen devices and cities.
Enhanced robustness of ASC classifiers.
Effective down-weighting of long-duration sounds during training.
Abstract
Acoustic scene classification (ASC) aims to identify the type of scene (environment) in which a given audio signal is recorded. The log-mel feature and convolutional neural network (CNN) have recently become the most popular time-frequency (TF) feature representation and classifier in ASC. An audio signal recorded in a scene may include various sounds overlapping in time and frequency. The previous study suggests that separately considering the long-duration sounds and short-duration sounds in CNN may improve ASC accuracy. This study addresses the problem of the generalization ability of acoustic scene classifiers. In practice, acoustic scene signals' characteristics may be affected by various factors, such as the choice of recording devices and the change of recording locations. When an established ASC system predicts scene classes on audios recorded in unseen scenarios, its accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Diverse Musicological Studies
MethodsAuxiliary Classifier
