Improved far-field speech recognition using Joint Variational Autoencoder
Shashi Kumar, Shakti P. Rath, Abhishek Pandey

TL;DR
This paper introduces a joint Variational Autoencoder (VAE) approach for far-field speech recognition, significantly improving accuracy over previous denoising autoencoder methods in matched training scenarios.
Contribution
The paper proposes a novel joint VAE-based mapping method that outperforms denoising autoencoders for far-field speech enhancement in matched training conditions.
Findings
2.5% absolute WER reduction over denoising autoencoder
3.96% absolute WER reduction compared to direct far-field training
Significant improvement in far-field speech recognition accuracy
Abstract
Automatic Speech Recognition (ASR) systems suffer considerably when source speech is corrupted with noise or room impulse responses (RIR). Typically, speech enhancement is applied in both mismatched and matched scenario training and testing. In matched setting, acoustic model (AM) is trained on dereverberated far-field features while in mismatched setting, AM is fixed. In recent past, mapping speech features from far-field to close-talk using denoising autoencoder (DA) has been explored. In this paper, we focus on matched scenario training and show that the proposed joint VAE based mapping achieves a significant improvement over DA. Specifically, we observe an absolute improvement of 2.5% in word error rate (WER) compared to DA based enhancement and 3.96% compared to AM trained directly on far-field filterbank features.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsAttention Model · Denoising Autoencoder
