Microphone Array Geometry Independent Multi-Talker Distant ASR: NTT System for the DASR Task of the CHiME-8 Challenge

Naoyuki Kamo; Naohiro Tawara; Atsushi Ando; Takatomo Kano; Hiroshi Sato; Rintaro Ikeshita; Takafumi Moriya; Shota Horiguchi; Kohei Matsuura; Atsunori Ogawa; Alexis Plaquet; Takanori Ashihara; Tsubasa Ochiai; Masato Mimura; Marc Delcroix; Tomohiro Nakatani; Taichi Asami; Shoko Araki

arXiv:2502.09859·eess.AS·June 23, 2025

Microphone Array Geometry Independent Multi-Talker Distant ASR: NTT System for the DASR Task of the CHiME-8 Challenge

Naoyuki Kamo, Naohiro Tawara, Atsushi Ando, Takatomo Kano, Hiroshi Sato, Rintaro Ikeshita, Takafumi Moriya, Shota Horiguchi, Kohei Matsuura, Atsunori Ogawa, Alexis Plaquet, Takanori Ashihara, Tsubasa Ochiai, Masato Mimura, Marc Delcroix, Tomohiro Nakatani, Taichi Asami

PDF

Open Access

TL;DR

This paper presents a robust multi-talker distant ASR system for the CHiME-8 challenge, integrating advanced diarization, speech enhancement, and foundation models, achieving significant performance improvements across diverse recording conditions.

Contribution

The paper introduces a novel microphone selection rule, enhanced diarization with EEND-VC and TS-VAD, and leverages Whisper and WavLM models for improved multi-talker distant speech recognition.

Findings

01

Achieved 63% relative macro tcpWER reduction over baseline.

02

Outperformed existing geometry-independent systems on NOTSOFAR-1 data.

03

Demonstrated robustness across various multi-talker recording scenarios.

Abstract

In this paper, we introduce a multi-talker distant automatic speech recognition (DASR) system we designed for the DASR task 1 of the CHiME-8 challenge. Our system performs speaker counting, diarization, and ASR. It handles various recording conditions, from diner parties to professional meetings and from two to eight speakers. We perform diarization first, followed by speech enhancement, and then ASR as the challenge baseline. However, we introduced several key refinements. First, we derived a powerful speaker diarization relying on end-to-end speaker diarization with vector clustering (EEND-VC), multi-channel speaker counting using enhanced embeddings from EEND-VC, and target-speaker voice activity detection (TS-VAD). For speech enhancement, we introduced a novel microphone selection rule to better select the most relevant microphones among the distributed microphones and investigated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing