End-to-end speaker diarization with transformer

Yongquan Lai; Xin Tang; Yuanyuan Fu; Rui Fang

arXiv:2112.07463·cs.SD·December 15, 2021·1 cites

End-to-end speaker diarization with transformer

Yongquan Lai, Xin Tang, Yuanyuan Fu, Rui Fang

PDF

Open Access

TL;DR

This paper introduces DiFormer, an end-to-end transformer-based model for speaker diarization that predicts speaker masks, vocal activities, and speaker vectors simultaneously, handling overlaps and unknown speaker counts.

Contribution

The novel DiFormer model applies set prediction with transformers to speaker diarization, integrating multi-scale features and temporal modeling for improved accuracy.

Findings

01

Effective handling of overlapping speech and unknown speakers

02

Improved diarization accuracy on multimedia and meeting datasets

03

Unified approach for speaker segmentation and vocal activity detection

Abstract

Speaker diarization is connected to semantic segmentation in computer vision. Inspired from MaskFormer \cite{cheng2021per} which treats semantic segmentation as a set-prediction problem, we propose an end-to-end approach to predict a set of targets consisting of binary masks, vocal activities and speaker vectors. Our model, which we coin \textit{DiFormer}, is mainly based on a speaker encoder and a feature pyramid network (FPN) module to extract multi-scale speaker features which are then fed into a transformer encoder-decoder to predict a set of diarization targets from learned query embedding. To account for temporal characteristics of speech signal, bidirectional LSTMs are inserted into the mask prediction module to improve temporal consistency. Our model handles unknown number of speakers, speech overlaps, as well as vocal activity detection in a unified way. Experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing