Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models
Naoyuki Kanda, Shota Horiguchi, Yusuke Fujita, Yawen Xue, Kenji, Nagamatsu, Shinji Watanabe

TL;DR
This paper presents an iterative approach combining target-speaker ASR and speaker diarization for single-channel dialogue recordings, effectively recognizing speech and identifying speakers without prior speaker samples.
Contribution
The paper introduces an iterative method that estimates speaker embeddings and applies TS-ASR, enabling speaker recognition and diarization without pre-known speaker samples.
Findings
Significantly reduced word error rate (WER) and diarization error rate (DER).
Achieved near-oracle performance with estimated speaker embeddings.
Outperformed conventional clustering-based diarization methods.
Abstract
This paper investigates the use of target-speaker automatic speech recognition (TS-ASR) for simultaneous speech recognition and speaker diarization of single-channel dialogue recordings. TS-ASR is a technique to automatically extract and recognize only the speech of a target speaker given a short sample utterance of that speaker. One obvious drawback of TS-ASR is that it cannot be used when the speakers in the recordings are unknown because it requires a sample of the target speakers in advance of decoding. To remove this limitation, we propose an iterative method, in which (i) the estimation of speaker embeddings and (ii) TS-ASR based on the estimated speaker embeddings are alternately executed. We evaluated the proposed method by using very challenging dialogue recordings in which the speaker overlap ratio was over 20%. We confirmed that the proposed method significantly reduced both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
