Building state-of-the-art distant speech recognition using the CHiME-4   challenge with a setup of speech enhancement baseline

Szu-Jui Chen; Aswin Shanmugam Subramanian; Hainan Xu; Shinji Watanabe

arXiv:1803.10109·cs.SD·March 28, 2018·5 cites

Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline

Szu-Jui Chen, Aswin Shanmugam Subramanian, Hainan Xu, Shinji Watanabe

PDF

Open Access

TL;DR

This paper presents a state-of-the-art, reproducible speech recognition system for noisy environments, utilizing advanced beamforming, neural network acoustic models, and language modeling, achieving top challenge performance.

Contribution

The paper introduces a simplified, high-performing baseline system for the CHiME-4 challenge, combining beamforming, neural network acoustic models, and language models, with publicly available code.

Findings

01

Achieved 2.74% WER on real test set, second place in CHiME-4 challenge

02

Proposed a speech enhancement pipeline with multiple quality measures

03

Provided a reproducible recipe for noisy speech recognition research

Abstract

This paper describes a new baseline system for automatic speech recognition (ASR) in the CHiME-4 challenge to promote the development of noisy ASR in speech processing communities by providing 1) state-of-the-art system with a simplified single system comparable to the complicated top systems in the challenge, 2) publicly available and reproducible recipe through the main repository in the Kaldi speech recognition toolkit. The proposed system adopts generalized eigenvalue beamforming with bidirectional long short-term memory (LSTM) mask estimation. We also propose to use a time delay neural network (TDNN) based on the lattice-free version of the maximum mutual information (LF-MMI) trained with augmented all six microphones plus the enhanced data after beamforming. Finally, we use a LSTM language model for lattice and n-best re-scoring. The final system achieved 2.74\% WER for the real…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory