Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges

Samuele Cornell; Christoph Boeddeker; Taejin Park; He Huang; Desh Raj; Matthew Wiesner; Yoshiki Masuyama; Xuankai Chang; Zhong-Qiu Wang; Stefano Squartini; Paola Garcia; Shinji Watanabe

arXiv:2507.18161·eess.AS·November 4, 2025

Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges

Samuele Cornell, Christoph Boeddeker, Taejin Park, He Huang, Desh Raj, Matthew Wiesner, Yoshiki Masuyama, Xuankai Chang, Zhong-Qiu Wang, Stefano Squartini, Paola Garcia, Shinji Watanabe

PDF

Open Access

TL;DR

This review analyzes recent advances in distant conversational speech recognition from CHiME-7 and 8 challenges, highlighting trends in end-to-end systems, neural speech separation, diarization, and the impact of large language models.

Contribution

It provides a comprehensive overview of the design, evaluation, and key trends in DASR challenges, emphasizing the shift to end-to-end models and the importance of diarization refinement.

Findings

01

Most systems now use end-to-end ASR, replacing hybrid models.

02

Neural speech separation techniques are still unreliable in complex scenarios.

03

Diarization refinement and accurate speaker counting are crucial for performance.

Abstract

The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech. With participation from 9 teams submitting 32 diverse systems, these challenges have contributed to state-of-the-art research in the field. This paper outlines the challenges' design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions. From this analysis it emerges that: 1) Most participants use end-to-end (e2e) ASR systems, whereas hybrid systems were prevalent in previous CHiME challenges. This transition is mainly due to the availability of robust large-scale pre-trained models, which lowers the data burden for e2e-ASR. 2) Despite recent advances in neural speech separation and enhancement (SSE), all teams still heavily rely on guided source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques