Low-Latency Speech Separation Guided Diarization for Telephone Conversations
Giovanni Morrone, Samuele Cornell, Desh Raj, Luca Serafini, Enrico, Zovato, Alessio Brutti, Stefano Squartini

TL;DR
This paper evaluates low-latency speech separation guided diarization (SSGD) for telephone conversations, demonstrating competitive diarization error rates and speech recognition performance with less data and lower latency than state-of-the-art methods.
Contribution
It introduces a low-latency online SSGD model with a novel post-processing algorithm, achieving high diarization accuracy and effective speech recognition integration.
Findings
DPRNN-based online SSGD achieves 11.1% DER on CALLHOME
Post-processing reduces false alarms significantly
Separated signals enable near-oracle speech recognition performance
Abstract
In this paper, we carry out an analysis on the use of speech separation guided diarization (SSGD) in telephone conversations. SSGD performs diarization by separating the speakers signals and then applying voice activity detection on each estimated speaker signal. In particular, we compare two low-latency speech separation models. Moreover, we show a post-processing algorithm that significantly reduces the false alarm errors of a SSGD pipeline. We perform our experiments on two datasets: Fisher Corpus Part 1 and CALLHOME, evaluating both separation and diarization metrics. Notably, our SSGD DPRNN-based online model achieves 11.1% DER on CALLHOME, comparable with most state-of-the-art end-to-end neural diarization models despite being trained on an order of magnitude less data and having considerably lower latency, i.e., 0.1 vs. 10 seconds. We also show that the separated signals can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Phonetics and Phonology Research
MethodsEnd-to-End Neural Diarization
