NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge
Naoyuki Kamo, Naohiro Tawara, Atsushi Ando, Takatomo Kano, Hiroshi, Sato, Rintaro Ikeshita, Takafumi Moriya, Shota Horiguchi, Kohei Matsuura,, Atsunori Ogawa, Alexis Plaquet, Takanori Ashihara, Tsubasa Ochiai, Masato, Mimura, Marc Delcroix, Tomohiro Nakatani, Taichi Asami

TL;DR
This paper introduces a comprehensive multi-speaker distant speech recognition system for the CHiME-8 challenge, combining diarization, speaker counting, source separation, and ASR models, achieving significant performance improvements.
Contribution
It presents a novel multi-channel speaker counting method and an integrated pipeline for DASR, combining advanced diarization, source separation, and pre-trained ASR models.
Findings
Achieved 21.3% macro tcpWER on dev set, 57% better than baseline.
Developed a new multi-channel speaker counting approach.
Enhanced source separation with guided source separation (GSS).
Abstract
We present a distant automatic speech recognition (DASR) system developed for the CHiME-8 DASR track. It consists of a diarization first pipeline. For diarization, we use end-to-end diarization with vector clustering (EEND-VC) followed by target speaker voice activity detection (TS-VAD) refinement. To deal with various numbers of speakers, we developed a new multi-channel speaker counting approach. We then apply guided source separation (GSS) with several improvements to the baseline system. Finally, we perform ASR using a combination of systems built from strong pre-trained models. Our proposed system achieves a macro tcpWER of 21.3 % on the dev set, which is a 57 % relative improvement over the baseline.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Processing Techniques · Speech Recognition and Synthesis · Robotics and Automated Systems
