Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio

Xinlu He; Jacob Whitehill

arXiv:2505.10975·cs.CL·January 15, 2026

Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio

Xinlu He, Jacob Whitehill

PDF

Open Access

TL;DR

This survey reviews recent end-to-end neural approaches for monaural multi-speaker speech recognition, analyzing architectures, improvements, and challenges to guide future research in this complex field.

Contribution

It provides a comprehensive taxonomy and comparative analysis of recent E2E multi-speaker ASR methods, highlighting key advancements and open challenges.

Findings

01

Analysis of SIMO vs. SISO architectures

02

Comparison of recent algorithmic improvements

03

Evaluation on standard benchmarks

Abstract

Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (SIMO vs.~SISO) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing