Content-Context Factorized Representations for Automated Speech   Recognition

David M. Chan; Shalini Ghosh

arXiv:2205.09872·eess.AS·September 16, 2022·1 cites

Content-Context Factorized Representations for Automated Speech Recognition

David M. Chan, Shalini Ghosh

PDF

Open Access

TL;DR

This paper introduces an unsupervised method to separate speech representations into content and context factors, improving ASR performance especially in noisy environments by reducing spurious correlations.

Contribution

It proposes a novel unsupervised, encoder-agnostic approach for factorizing speech representations into content and context, enhancing robustness and generalization in ASR.

Findings

01

Improved ASR accuracy on standard benchmarks.

02

Enhanced robustness in noisy conditions.

03

Effective separation of content and context representations.

Abstract

Deep neural networks have largely demonstrated their ability to perform automated speech recognition (ASR) by extracting meaningful features from input audio frames. Such features, however, may consist not only of information about the spoken language content, but also may contain information about unnecessary contexts such as background noise and sounds or speaker identity, accent, or protected attributes. Such information can directly harm generalization performance, by introducing spurious correlations between the spoken words and the context in which such words were spoken. In this work, we introduce an unsupervised, encoder-agnostic method for factoring speech-encoder representations into explicit content-encoding representations and spurious context-encoding representations. By doing so, we demonstrate improved performance on standard ASR benchmarks, as well as improved…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing