Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio
Xinlu He, Jacob Whitehill

TL;DR
This survey reviews recent end-to-end neural approaches for monaural multi-speaker speech recognition, analyzing architectures, improvements, and challenges to guide future research in this complex field.
Contribution
It provides a comprehensive taxonomy and comparative analysis of recent E2E multi-speaker ASR methods, highlighting key advancements and open challenges.
Findings
Analysis of SIMO vs. SISO architectures
Comparison of recent algorithmic improvements
Evaluation on standard benchmarks
Abstract
Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (SIMO vs.~SISO) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
