TL;DR
This paper introduces a novel time-frequency transformation layer based on complex frequency B-spline wavelets, improving accuracy and robustness in environmental sound classification over traditional methods like STFT.
Contribution
The paper presents a new fbsp-layer for time-frequency transformation that enhances classification accuracy and robustness in audio models, with analysis of pre-training strategies and noise resilience.
Findings
Achieved 95.20% accuracy on ESC-50 dataset.
Achieved 89.14% accuracy on UrbanSound8K dataset.
Demonstrated increased robustness against noise and signal reduction.
Abstract
Environmental Sound Classification (ESC) is a rapidly evolving field that recently demonstrated the advantages of application of visual domain techniques to the audio-related tasks. Previous studies indicate that the domain-specific modification of cross-domain approaches show a promise in pushing the whole area of ESC forward. In this paper, we present a new time-frequency transformation layer that is based on complex frequency B-spline (fbsp) wavelets. Being used with a high-performance audio classification model, the proposed fbsp-layer provides an accuracy improvement over the previously used Short-Time Fourier Transform (STFT) on standard datasets. We also investigate the influence of different pre-training strategies, including the joint use of two large-scale datasets for weight initialization: ImageNet and AudioSet. Our proposed model out-performs other approaches by achieving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
