Multimodal Speaker Segmentation and Diarization using Lexical and Acoustic Cues via Sequence to Sequence Neural Networks
Tae Jin Park, Panayiotis Georgiou

TL;DR
This paper presents a novel multimodal speaker diarization system that combines lexical and acoustic cues using sequence-to-sequence neural networks, improving accuracy over traditional methods.
Contribution
It introduces a joint lexical-acoustic neural model with a new loss function for better speaker change detection and grouping.
Findings
Multimodal approach outperforms lexical-only and acoustic-only systems.
Proposed method reduces Diarization Error Rate compared to baseline.
Performance remains superior even with ASR transcripts, despite some drop.
Abstract
While there has been substantial amount of work in speaker diarization recently, there are few efforts in jointly employing lexical and acoustic information for speaker segmentation. Towards that, we investigate a speaker diarization system using a sequence-to-sequence neural network trained on both lexical and acoustic features. We also propose a loss function that allows for selecting not only the speaker change points but also the best speaker at any time by allowing for different speaker groupings. We incorporate Mel Frequency Cepstral Coefficients (MFCC) as an acoustic feature alongside lexical information that are obtained from conversations from the Fisher dataset. Thus, we show that acoustics provide complementary information to the lexical modality. The experimental results show that sequence-to-sequence system trained on both word sequences and MFCC can improve on speaker…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
