Towards Disentangled Speech Representations

Cal Peyser; Ronny Huang Andrew Rosenberg Tara N. Sainath; Michael; Picheny; Kyunghyun Cho

arXiv:2208.13191·cs.SD·August 30, 2022

Towards Disentangled Speech Representations

Cal Peyser, Ronny Huang Andrew Rosenberg Tara N. Sainath, Michael, Picheny, Kyunghyun Cho

PDF

Open Access

TL;DR

This paper explores learning disentangled speech representations by joint modeling of ASR and TTS, showing that enforcing statistical properties during training significantly improves transcription accuracy.

Contribution

It introduces a novel approach to learning disentangled speech representations by leveraging statistical properties and joint ASR-TTS modeling, improving WER.

Findings

01

Disentangled representations are linked to training randomness.

02

Enforcing statistical properties enhances transcription accuracy.

03

Achieved 24.5% relative WER reduction.

Abstract

The careful construction of audio representations has become a dominant feature in the design of approaches to many speech tasks. Increasingly, such approaches have emphasized "disentanglement", where a representation contains only parts of the speech signal relevant to transcription while discarding irrelevant information. In this paper, we construct a representation learning task based on joint modeling of ASR and TTS, and seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not. We present empirical evidence that successfully finding such a representation is tied to the randomness inherent in training. We then make the observation that these desired, disentangled solutions to the optimization problem possess unique statistical properties. Finally, we show that enforcing these properties…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis