The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 Challenge
Shutong Niu, Ruoyu Wang, Jun Du, Gaobin Yang, Yanhui Tu, Siyuan Wu,, Shuangqing Qian, Huaxin Wu, Haitao Xu, Xueyang Zhang, Guolong Zhong, Xindi, Yu, Jieru Chen, Mengzhi Wang, Di Cai, Tian Gao, Genshun Wan, Feng Ma, Jia, Pan, Jianqing Gao

TL;DR
This paper presents a robust multi-channel speech processing system for the CHiME-8 challenge, combining joint training for diarization and separation with advanced ASR techniques, achieving state-of-the-art error rates.
Contribution
It introduces a data-driven joint training method for diarization and separation, and enhances Whisper ASR with multiple innovations for improved robustness in real-world conditions.
Findings
Achieved a tcpWER of 14.265% on multi-channel data.
Achieved a tcpWER of 22.989% on single-channel data.
Demonstrated effectiveness of combined GSS and JDS methods.
Abstract
This technical report outlines our submission system for the CHiME-8 NOTSOFAR-1 Challenge. The primary difficulty of this challenge is the dataset recorded across various conference rooms, which captures real-world complexities such as high overlap rates, background noises, a variable number of speakers, and natural conversation styles. To address these issues, we optimized the system in several aspects: For front-end speech signal processing, we introduced a data-driven joint training method for diarization and separation (JDS) to enhance audio quality. Additionally, we also integrated traditional guided source separation (GSS) for multi-channel track to provide complementary information for the JDS. For back-end speech recognition, we enhanced Whisper with WavLM, ConvNeXt, and Transformer innovations, applying multi-task training and Noise KLD augmentation, to significantly advance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSeismic Imaging and Inversion Techniques · Geophysics and Gravity Measurements
MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · ConvNeXt · Softmax · Label Smoothing · Layer Normalization · Dropout · Position-Wise Feed-Forward Layer · Residual Connection
