The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge

Ruoyu Wang; Maokui He; Jun Du; Hengshun Zhou; Shutong Niu; Hang Chen,; Yanyan Yue; Gaobin Yang; Shilong Wu; Lei Sun; Yanhui Tu; Haitao Tang,; Shuangqing Qian; Tian Gao; Mengzhi Wang; Genshun Wan; Jia Pan; Jianqing Gao,; Chin-Hui Lee

arXiv:2308.14638·eess.AS·October 12, 2023

The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge

Ruoyu Wang, Maokui He, Jun Du, Hengshun Zhou, Shutong Niu, Hang Chen,, Yanyan Yue, Gaobin Yang, Shilong Wu, Lei Sun, Yanhui Tu, Haitao Tang,, Shuangqing Qian, Tian Gao, Mengzhi Wang, Genshun Wan, Jia Pan, Jianqing Gao,, Chin-Hui Lee

PDF

Open Access

TL;DR

This paper presents a system for speaker diarization and speech recognition in complex multi-speaker scenarios, achieving significant WER improvements through end-to-end models and spatial information rectification.

Contribution

The authors developed an end-to-end diarization system with a novel spatial rectification strategy and leveraged pre-trained models for speech recognition, improving performance on CHiME-7.

Findings

01

Achieved a DA-WER of 21.01% on CHiME-7 evaluation set

02

Reduced WER by 62.04% relative to baseline

03

Implemented multi-channel spatial information rectification

Abstract

This technical report details our submission system to the CHiME-7 DASR Challenge, which focuses on speaker diarization and speech recognition under complex multi-speaker scenarios. Additionally, it also evaluates the efficiency of systems in handling diverse array devices. To address these issues, we implemented an end-to-end speaker diarization system and introduced a rectification strategy based on multi-channel spatial information. This approach significantly diminished the word error rates (WER). In terms of recognition, we utilized publicly available pre-trained models as the foundational models to train our end-to-end speech recognition models. Our system attained a Macro-averaged diarization-attributed WER (DA-WER) of 21.01% on the CHiME-7 evaluation set, which signifies a relative improvement of 62.04% over the official baseline system.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing