USED: Universal Speaker Extraction and Diarization
Junyi Ao, Mehmet Sinan Y{\i}ld{\i}r{\i}m, Ruijie Tao, Meng Ge, Shuai, Wang, Yanmin Qian, Haizhou Li

TL;DR
USED is a unified model that jointly performs speaker extraction and diarization, effectively handling overlapping speech and variable speaker scenarios, leading to superior performance on multiple datasets.
Contribution
The paper introduces USED, a novel unified model that integrates speaker extraction and diarization to improve accuracy and consistency in real-world speech applications.
Findings
Outperforms baseline methods on LibriMix and SparseLibriMix datasets.
Achieves better diarization results on CALLHOME dataset.
Effectively manages speech mixtures with varying overlap ratios.
Abstract
Speaker extraction and diarization are two enabling techniques for real-world speech applications. Speaker extraction aims to extract a target speaker's voice from a speech mixture, while speaker diarization demarcates speech segments by speaker, annotating `who spoke when'. Previous studies have typically treated the two tasks independently. In practical applications, it is more meaningful to have knowledge about `who spoke what and when', which is captured by the two tasks. The two tasks share a similar objective of disentangling speakers. Speaker extraction operates in the frequency domain, whereas diarization is in the temporal domain. It is logical to believe that speaker activities obtained from speaker diarization can benefit speaker extraction, while the extracted speech offers more accurate speaker activity detection than the speech mixture. In this paper, we propose a unified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
