CNN depth analysis with different channel inputs for Acoustic Scene Classification
Sergi Perez-Castanos, Javier Naranjo-Alcazar, Pedro Zuccarello, Maximo, Cobos, Frances J. Ferri

TL;DR
This paper investigates the impact of different audio representations, ensemble aggregation techniques, and network depths on acoustic scene classification performance, emphasizing real-time edge device applications.
Contribution
It provides a comprehensive analysis of audio representations, ensemble methods, and network depths tailored for real-time acoustic scene classification on edge devices.
Findings
Harmonic, percussive, and stereo difference features yield the best results.
Geometric and arithmetic mean, and OWA are effective ensemble aggregation methods.
Shallow networks and efficient ensemble strategies suit real-time edge applications.
Abstract
Acoustic scene classification (ASC) has been approached in the last years using deep learning techniques such as convolutional neural networks or recurrent neural networks. Many state-of-the-art solutions are based on image classification frameworks and, as such, a 2D representation of the audio signal is considered for training these networks. Finding the most suitable audio representation is still a research area of interest. In this paper, different log-Mel representations and combinations are analyzed. Experiments show that the best results are obtained using the harmonic and percussive components plus the difference between left and right stereo channels, (L-R). On the other hand, it is a common strategy to ensemble different models in order to increase the final accuracy. Even though averaging different model predictions is a common choice, an exhaustive analysis of different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Digital Media Forensic Detection
