Improved far-field speech recognition using Joint Variational   Autoencoder

Shashi Kumar; Shakti P. Rath; Abhishek Pandey

arXiv:2204.11286·eess.AS·April 26, 2022

Improved far-field speech recognition using Joint Variational Autoencoder

Shashi Kumar, Shakti P. Rath, Abhishek Pandey

PDF

Open Access

TL;DR

This paper introduces a joint Variational Autoencoder (VAE) approach for far-field speech recognition, significantly improving accuracy over previous denoising autoencoder methods in matched training scenarios.

Contribution

The paper proposes a novel joint VAE-based mapping method that outperforms denoising autoencoders for far-field speech enhancement in matched training conditions.

Findings

01

2.5% absolute WER reduction over denoising autoencoder

02

3.96% absolute WER reduction compared to direct far-field training

03

Significant improvement in far-field speech recognition accuracy

Abstract

Automatic Speech Recognition (ASR) systems suffer considerably when source speech is corrupted with noise or room impulse responses (RIR). Typically, speech enhancement is applied in both mismatched and matched scenario training and testing. In matched setting, acoustic model (AM) is trained on dereverberated far-field features while in mismatched setting, AM is fixed. In recent past, mapping speech features from far-field to close-talk using denoising autoencoder (DA) has been explored. In this paper, we focus on matched scenario training and show that the proposed joint VAE based mapping achieves a significant improvement over DA. Specifically, we observe an absolute improvement of 2.5% in word error rate (WER) compared to DA based enhancement and 3.96% compared to AM trained directly on far-field filterbank features.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsAttention Model · Denoising Autoencoder