Content-Context Factorized Representations for Automated Speech Recognition
David M. Chan, Shalini Ghosh

TL;DR
This paper introduces an unsupervised method to separate speech representations into content and context factors, improving ASR performance especially in noisy environments by reducing spurious correlations.
Contribution
It proposes a novel unsupervised, encoder-agnostic approach for factorizing speech representations into content and context, enhancing robustness and generalization in ASR.
Findings
Improved ASR accuracy on standard benchmarks.
Enhanced robustness in noisy conditions.
Effective separation of content and context representations.
Abstract
Deep neural networks have largely demonstrated their ability to perform automated speech recognition (ASR) by extracting meaningful features from input audio frames. Such features, however, may consist not only of information about the spoken language content, but also may contain information about unnecessary contexts such as background noise and sounds or speaker identity, accent, or protected attributes. Such information can directly harm generalization performance, by introducing spurious correlations between the spoken words and the context in which such words were spoken. In this work, we introduce an unsupervised, encoder-agnostic method for factoring speech-encoder representations into explicit content-encoding representations and spurious context-encoding representations. By doing so, we demonstrate improved performance on standard ASR benchmarks, as well as improved…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
