Simultaneous Speech Recognition and Speaker Diarization for Monaural   Dialogue Recordings with Target-Speaker Acoustic Models

Naoyuki Kanda; Shota Horiguchi; Yusuke Fujita; Yawen Xue; Kenji; Nagamatsu; Shinji Watanabe

arXiv:1909.08103·cs.CL·September 19, 2019·6 cites

Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models

Naoyuki Kanda, Shota Horiguchi, Yusuke Fujita, Yawen Xue, Kenji, Nagamatsu, Shinji Watanabe

PDF

Open Access

TL;DR

This paper presents an iterative approach combining target-speaker ASR and speaker diarization for single-channel dialogue recordings, effectively recognizing speech and identifying speakers without prior speaker samples.

Contribution

The paper introduces an iterative method that estimates speaker embeddings and applies TS-ASR, enabling speaker recognition and diarization without pre-known speaker samples.

Findings

01

Significantly reduced word error rate (WER) and diarization error rate (DER).

02

Achieved near-oracle performance with estimated speaker embeddings.

03

Outperformed conventional clustering-based diarization methods.

Abstract

This paper investigates the use of target-speaker automatic speech recognition (TS-ASR) for simultaneous speech recognition and speaker diarization of single-channel dialogue recordings. TS-ASR is a technique to automatically extract and recognize only the speech of a target speaker given a short sample utterance of that speaker. One obvious drawback of TS-ASR is that it cannot be used when the speakers in the recordings are unknown because it requires a sample of the target speakers in advance of decoding. To remove this limitation, we propose an iterative method, in which (i) the estimation of speaker embeddings and (ii) TS-ASR based on the estimated speaker embeddings are alternately executed. We evaluated the proposed method by using very challenging dialogue recordings in which the speaker overlap ratio was over 20%. We confirmed that the proposed method significantly reduced both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing