Speaker Embedding-aware Neural Diarization for Flexible Number of   Speakers with Textual Information

Zhihao Du; Shiliang Zhang; Siqi Zheng; Weilong Huang; Ming Lei

arXiv:2111.13694·cs.SD·November 30, 2021·1 cites

Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information

Zhihao Du, Shiliang Zhang, Siqi Zheng, Weilong Huang, Ming Lei

PDF

Open Access 2 Repos

TL;DR

This paper introduces SEND, a neural diarization method that predicts multi-speaker labels using power set encoding, and enhances performance by incorporating textual information, significantly reducing diarization errors especially in real meetings.

Contribution

The paper proposes a novel speaker embedding-aware neural diarization approach that reformulates multi-label prediction as single-label, and integrates textual data for improved diarization accuracy.

Findings

01

Achieves lower diarization error rate than target-speaker VAD.

02

Incorporating textual information further reduces errors.

03

Real meeting scenario shows 34.11% relative improvement over traditional clustering.

Abstract

Overlapping speech diarization is always treated as a multi-label classification problem. In this paper, we reformulate this task as a single-label prediction problem by encoding the multi-speaker labels with power set. Specifically, we propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels according to the similarities between speech features and given speaker embeddings. Our method is further extended and integrated with downstream tasks by utilizing the textual information, which has not been well studied in previous literature. The experimental results show that our method achieves lower diarization error rate than the target-speaker voice activity detection. When textual information is involved, the diarization errors can be further reduced. For the real meeting scenario, our method can achieve 34.11% relative improvement…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing