Multimodal Speaker Segmentation and Diarization using Lexical and   Acoustic Cues via Sequence to Sequence Neural Networks

Tae Jin Park; Panayiotis Georgiou

arXiv:1805.10731·eess.AS·May 29, 2018

Multimodal Speaker Segmentation and Diarization using Lexical and Acoustic Cues via Sequence to Sequence Neural Networks

Tae Jin Park, Panayiotis Georgiou

PDF

TL;DR

This paper presents a novel multimodal speaker diarization system that combines lexical and acoustic cues using sequence-to-sequence neural networks, improving accuracy over traditional methods.

Contribution

It introduces a joint lexical-acoustic neural model with a new loss function for better speaker change detection and grouping.

Findings

01

Multimodal approach outperforms lexical-only and acoustic-only systems.

02

Proposed method reduces Diarization Error Rate compared to baseline.

03

Performance remains superior even with ASR transcripts, despite some drop.

Abstract

While there has been substantial amount of work in speaker diarization recently, there are few efforts in jointly employing lexical and acoustic information for speaker segmentation. Towards that, we investigate a speaker diarization system using a sequence-to-sequence neural network trained on both lexical and acoustic features. We also propose a loss function that allows for selecting not only the speaker change points but also the best speaker at any time by allowing for different speaker groupings. We incorporate Mel Frequency Cepstral Coefficients (MFCC) as an acoustic feature alongside lexical information that are obtained from conversations from the Fisher dataset. Thus, we show that acoustics provide complementary information to the lexical modality. The experimental results show that sequence-to-sequence system trained on both word sequences and MFCC can improve on speaker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.