Token-level Speaker Change Detection Using Speaker Difference and Speech   Content via Continuous Integrate-and-fire

Zhiyun Fan; Zhenlin Liang; Linhao Dong; Yi Liu; Shiyu Zhou; Meng Cai,; Jun Zhang; Zejun Ma; Bo Xu

arXiv:2211.09381·cs.SD·November 18, 2022

Token-level Speaker Change Detection Using Speaker Difference and Speech Content via Continuous Integrate-and-fire

Zhiyun Fan, Zhenlin Liang, Linhao Dong, Yi Liu, Shiyu Zhou, Meng Cai,, Jun Zhang, Zejun Ma, Bo Xu

PDF

Open Access

TL;DR

This paper introduces a novel token-level speaker change detection method that combines speaker difference and speech content cues using the continuous integrate-and-fire mechanism, improving accuracy over frame-level systems.

Contribution

It proposes a new SCD approach that integrates speech content and speaker difference cues at token-level, utilizing CIF for better boundary detection.

Findings

01

Outperforms frame-level baseline by 2.45% ECP.

02

Highlights importance of speech content in SCD.

03

Demonstrates advantages of token-level detection over frame-level.

Abstract

In multi-talker scenarios such as meetings and conversations, speech processing systems are usually required to segment the audio and then transcribe each segmentation. These two stages are addressed separately by speaker change detection (SCD) and automatic speech recognition (ASR). Most previous SCD systems rely solely on speaker information and ignore the importance of speech content. In this paper, we propose a novel SCD system that considers both cues of speaker difference and speech content. These two cues are converted into token-level representations by the continuous integrate-and-fire (CIF) mechanism and then combined for detecting speaker changes on the token acoustic boundaries. We evaluate the performance of our approach on a public real-recorded meeting dataset, AISHELL-4. The experiment results show that our method outperforms a competitive frame-level baseline system by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing