A Review of Speaker Diarization: Recent Advances with Deep Learning
Tae Jin Park, Naoyuki Kanda, Dimitrios Dimitriadis, Kyu J. Han, Shinji, Watanabe, Shrikanth Narayanan

TL;DR
This paper reviews the evolution and recent advances in speaker diarization, emphasizing deep learning methods and their integration with speech recognition to improve identification of who spoke when in audio recordings.
Contribution
It provides a comprehensive survey of neural speaker diarization techniques and discusses their integration with speech recognition systems, highlighting recent technological trends.
Findings
Deep learning has revolutionized speaker diarization methods.
Neural approaches outperform traditional algorithms in accuracy.
Integrated speaker diarization and speech recognition systems are emerging.
Abstract
Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity, or in short, a task to identify "who spoke when". In the early years, speaker diarization algorithms were developed for speech recognition on multispeaker audio recordings to enable speaker adaptive processing. These algorithms also gained their own value as a standalone application over time to provide speaker-specific metainformation for downstream tasks such as audio retrieval. More recently, with the emergence of deep learning technology, which has driven revolutionary changes in research and practices across speech application domains, rapid advancements have been made for speaker diarization. In this paper, we review not only the historical development of speaker diarization technology but also the recent advancements in neural speaker diarization approaches.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
