Acoustic modeling for Overlapping Speech Recognition: JHU Chime-5   Challenge System

Vimal Manohar; Szu-Jui Chen; Zhiqi Wang; Yusuke Fujita; Shinji; Watanabe; Sanjeev Khudanpur

arXiv:2405.11078·eess.AS·May 21, 2024·ICASSP

Acoustic modeling for Overlapping Speech Recognition: JHU Chime-5 Challenge System

Vimal Manohar, Szu-Jui Chen, Zhiqi Wang, Yusuke Fujita, Shinji, Watanabe, Sanjeev Khudanpur

PDF

Open Access 1 Repo

TL;DR

This paper presents an acoustic modeling system for overlapping speech recognition in the CHiME-5 challenge, utilizing data augmentation, neural networks, dereverberation, beamforming, and i-vector extraction, achieving significant WER improvements.

Contribution

It introduces an improved acoustic modeling approach with refined techniques and tools, advancing the baseline performance for overlapping dinner party speech recognition.

Findings

01

Achieved 69.4% WER on the development set

02

Reduced WER by 11.7% absolute over previous baseline

03

Developed an advanced CHiME-5 recognition recipe

Abstract

This paper summarizes our acoustic modeling efforts in the Johns Hopkins University speech recognition system for the CHiME-5 challenge to recognize highly-overlapped dinner party speech recorded by multiple microphone arrays. We explore data augmentation approaches, neural network architectures, front-end speech dereverberation, beamforming and robust i-vector extraction with comparisons of our in-house implementations and publicly available tools. We finally achieved a word error rate of 69.4% on the development set, which is a 11.7% absolute improvement over the previous baseline of 81.1%, and release this improved baseline with refined techniques/tools as an advanced CHiME-5 recipe.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fgnt/nara_wpe
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis