SPADE: Self-supervised Pretraining for Acoustic DisEntanglement

John Harvill; Jarred Barber; Arun Nair; Ramin Pishehvar

arXiv:2302.01483·cs.LG·February 6, 2023·1 cites

SPADE: Self-supervised Pretraining for Acoustic DisEntanglement

John Harvill, Jarred Barber, Arun Nair, Ramin Pishehvar

PDF

Open Access

TL;DR

This paper introduces SPADE, a self-supervised pretraining method that effectively disentangles room acoustics from speech signals, improving downstream device arbitration especially with limited labeled data.

Contribution

SPADE is the first self-supervised approach to disentangle room acoustics from speech, enhancing acoustic representation learning for speech processing tasks.

Findings

01

Significantly outperforms baselines with scarce labeled data

02

Learns to encode room acoustic information invariant to speech attributes

03

Improves device arbitration performance

Abstract

Self-supervised representation learning approaches have grown in popularity due to the ability to train models on large amounts of unlabeled data and have demonstrated success in diverse fields such as natural language processing, computer vision, and speech. Previous self-supervised work in the speech domain has disentangled multiple attributes of speech such as linguistic content, speaker identity, and rhythm. In this work, we introduce a self-supervised approach to disentangle room acoustics from speech and use the acoustic representation on the downstream task of device arbitration. Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce, indicating that our pretraining scheme learns to encode room acoustic information while remaining invariant to other attributes of the speech signal.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing