Contrastive Environmental Sound Representation Learning
Peter Ochieng, Dennis Kaburu

TL;DR
This paper introduces a self-supervised contrastive learning approach using a shallow 1D CNN to extract robust environmental sound representations from raw audio and spectrograms, improving recognition accuracy.
Contribution
It proposes a novel contrastive learning method with multi-input fusion via CCA for environmental sound representation without annotations.
Findings
Achieved 12.8% improvement on ESC-50 dataset.
Achieved 0.9% improvement on UrbanSound8K dataset.
Demonstrated robustness of fused features over individual representations.
Abstract
Machine hearing of the environmental sound is one of the important issues in the audio recognition domain. It gives the machine the ability to discriminate between the different input sounds that guides its decision making. In this work we exploit the self-supervised contrastive technique and a shallow 1D CNN to extract the distinctive audio features (audio representations) without using any explicit annotations.We generate representations of a given audio using both its raw audio waveform and spectrogram and evaluate if the proposed learner is agnostic to the type of audio input. We further use canonical correlation analysis (CCA) to fuse representations from the two types of input of a given audio and demonstrate that the fused global feature results in robust representation of the audio signal as compared to the individual representations. The evaluation of the proposed technique is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Diverse Musicological Studies
Methods1-Dimensional Convolutional Neural Networks
