A Two-Stage Approach to Device-Robust Acoustic Scene Classification
Hu Hu, Chao-Han Huck Yang, Xianjun Xia, Xue Bai, Xin Tang, Yajian, Wang, Shutong Niu, Li Chai, Juanjuan Li, Hongning Zhu, Feng Bao, Yuanjun, Zhao, Sabato Marco Siniscalchi, Yannan Wang, Jun Du, Chin-Hui Lee

TL;DR
This paper introduces a novel two-stage CNN-based system for acoustic scene classification that significantly improves device robustness and achieves state-of-the-art accuracy on DCASE 2020 data.
Contribution
A new two-stage CNN approach with score fusion, novel data augmentation, and analysis techniques enhances device-robust acoustic scene classification performance.
Findings
Achieves 81.9% accuracy on multi-device test data.
Significant improvement on unseen devices.
Provides insights via neural saliency analysis.
Abstract
To improve device robustness, a highly desirable key feature of a competitive data-driven acoustic scene classification (ASC) system, a novel two-stage system based on fully convolutional neural networks (CNNs) is proposed. Our two-stage system leverages on an ad-hoc score combination based on two CNN classifiers: (i) the first CNN classifies acoustic inputs into one of three broad classes, and (ii) the second CNN classifies the same inputs into one of ten finer-grained classes. Three different CNN architectures are explored to implement the two-stage classifiers, and a frequency sub-sampling scheme is investigated. Moreover, novel data augmentation schemes for ASC are also investigated. Evaluated on DCASE 2020 Task 1a, our results show that the proposed ASC system attains a state-of-the-art accuracy on the development set, where our best system, a two-stage fusion of CNN ensembles,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
