Acoustic modeling for Overlapping Speech Recognition: JHU Chime-5 Challenge System
Vimal Manohar, Szu-Jui Chen, Zhiqi Wang, Yusuke Fujita, Shinji, Watanabe, Sanjeev Khudanpur

TL;DR
This paper presents an acoustic modeling system for overlapping speech recognition in the CHiME-5 challenge, utilizing data augmentation, neural networks, dereverberation, beamforming, and i-vector extraction, achieving significant WER improvements.
Contribution
It introduces an improved acoustic modeling approach with refined techniques and tools, advancing the baseline performance for overlapping dinner party speech recognition.
Findings
Achieved 69.4% WER on the development set
Reduced WER by 11.7% absolute over previous baseline
Developed an advanced CHiME-5 recognition recipe
Abstract
This paper summarizes our acoustic modeling efforts in the Johns Hopkins University speech recognition system for the CHiME-5 challenge to recognize highly-overlapped dinner party speech recorded by multiple microphone arrays. We explore data augmentation approaches, neural network architectures, front-end speech dereverberation, beamforming and robust i-vector extraction with comparisons of our in-house implementations and publicly available tools. We finally achieved a word error rate of 69.4% on the development set, which is a 11.7% absolute improvement over the previous baseline of 81.1%, and release this improved baseline with refined techniques/tools as an advanced CHiME-5 recipe.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
