SPADE: Self-supervised Pretraining for Acoustic DisEntanglement
John Harvill, Jarred Barber, Arun Nair, Ramin Pishehvar

TL;DR
This paper introduces SPADE, a self-supervised pretraining method that effectively disentangles room acoustics from speech signals, improving downstream device arbitration especially with limited labeled data.
Contribution
SPADE is the first self-supervised approach to disentangle room acoustics from speech, enhancing acoustic representation learning for speech processing tasks.
Findings
Significantly outperforms baselines with scarce labeled data
Learns to encode room acoustic information invariant to speech attributes
Improves device arbitration performance
Abstract
Self-supervised representation learning approaches have grown in popularity due to the ability to train models on large amounts of unlabeled data and have demonstrated success in diverse fields such as natural language processing, computer vision, and speech. Previous self-supervised work in the speech domain has disentangled multiple attributes of speech such as linguistic content, speaker identity, and rhythm. In this work, we introduce a self-supervised approach to disentangle room acoustics from speech and use the acoustic representation on the downstream task of device arbitration. Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce, indicating that our pretraining scheme learns to encode room acoustic information while remaining invariant to other attributes of the speech signal.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
