Ensemble of Jointly Trained Deep Neural Network-Based Acoustic Models for Reverberant Speech Recognition
Jeehye Lee, Myungin Lee, and Joon-Hyuk Chang

TL;DR
This paper introduces an ensemble of jointly trained deep neural networks tailored for reverberant speech recognition, improving accuracy across diverse reverberation conditions by combining models trained for different RT60s and dereverberation.
Contribution
It proposes a novel ensemble approach with joint training for feature mapping and acoustic modeling, optimized for reverberant environments, and an online model selection method based on RT60 estimation.
Findings
Significant accuracy improvements over baseline systems.
Effective model selection using online RT60 estimation.
Robust performance across various reverberant conditions.
Abstract
Distant speech recognition is a challenge, particularly due to the corruption of speech signals by reverberation caused by large distances between the speaker and microphone. In order to cope with a wide range of reverberations in real-world situations, we present novel approaches for acoustic modeling including an ensemble of deep neural networks (DNNs) and an ensemble of jointly trained DNNs. First, multiple DNNs are established, each of which corresponds to a different reverberation time 60 (RT60) in a setup step. Also, each model in the ensemble of DNN acoustic models is further jointly trained, including both feature mapping and acoustic modeling, where the feature mapping is designed for the dereverberation as a front-end. In a testing phase, the two most likely DNNs are chosen from the DNN ensemble using maximum a posteriori (MAP) probabilities, computed in an online fashion by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
