Exploring Self-Supervised Contrastive Learning of Spatial Sound Event   Representation

Xilin Jiang; Cong Han; Yinghao Aaron Li; Nima Mesgarani

arXiv:2309.15938·eess.AS·September 29, 2023·1 cites

Exploring Self-Supervised Contrastive Learning of Spatial Sound Event Representation

Xilin Jiang, Cong Han, Yinghao Aaron Li, Nima Mesgarani

PDF

Open Access

TL;DR

This paper introduces MC-SimCLR, a contrastive learning framework for spatial audio that improves event classification and localization by learning joint spectral and spatial representations through multi-level data augmentation.

Contribution

It proposes a novel multi-channel contrastive learning method with a multi-level augmentation pipeline for spatial audio representation learning.

Findings

01

Linear layers outperform supervised models in classification and localization.

02

Augmentation methods significantly impact representation quality.

03

Fine-tuning with less labeled data remains effective.

Abstract

In this study, we present a simple multi-channel framework for contrastive learning (MC-SimCLR) to encode 'what' and 'where' of spatial audios. MC-SimCLR learns joint spectral and spatial representations from unlabeled spatial audios, thereby enhancing both event classification and sound localization in downstream tasks. At its core, we propose a multi-level data augmentation pipeline that augments different levels of audio features, including waveforms, Mel spectrograms, and generalized cross-correlation (GCC) features. In addition, we introduce simple yet effective channel-wise augmentation methods to randomly swap the order of the microphones and mask Mel and GCC channels. By using these augmentations, we find that linear layers on top of the learned representation significantly outperform supervised models in terms of both event classification accuracy and localization error. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Acoustic Wave Phenomena Research

MethodsContrastive Learning