Joint speaker diarisation and tracking in switching state-space model

Jeremy H. M. Wong; Yifan Gong

arXiv:2109.11140·cs.SD·September 24, 2021

Joint speaker diarisation and tracking in switching state-space model

Jeremy H. M. Wong, Yifan Gong

PDF

Open Access

TL;DR

This paper introduces a unified state-space model with particle filtering to jointly track speaker movements and perform diarisation, effectively handling moving speakers in meetings.

Contribution

It proposes a novel joint model that combines speaker diarisation and movement tracking, relaxing the stationary speaker assumption in previous methods.

Findings

01

Performs comparably with methods using location info.

02

Effectively tracks moving speakers during meetings.

03

Uses particle filter for implementation.

Abstract

Speakers may move around while diarisation is being performed. When a microphone array is used, the instantaneous locations of where the sounds originated from can be estimated, and previous investigations have shown that such information can be complementary to speaker embeddings in the diarisation task. However, these approaches often assume that speakers are fairly stationary throughout a meeting. This paper relaxes this assumption, by proposing to explicitly track the movements of speakers while jointly performing diarisation within a unified model. A state-space model is proposed, where the hidden state expresses the identity of the current active speaker and the predicted locations of all speakers. The model is implemented as a particle filter. Experiments on a Microsoft rich meeting transcription task show that the proposed joint location tracking and diarisation approach is able…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing