The Speed Submission to DIHARD II: Contributions & Lessons Learned

Md Sahidullah; Jose Patino; Samuele Cornell; Ruiqing Yin; Sunit; Sivasankaran; Herv\'e Bredin; Pavel Korshunov; Alessio Brutti; Romain; Serizel; Emmanuel Vincent; Nicholas Evans; S\'ebastien Marcel; Stefano; Squartini; Claude Barras

arXiv:1911.02388·eess.AS·November 7, 2019

The Speed Submission to DIHARD II: Contributions & Lessons Learned

Md Sahidullah, Jose Patino, Samuele Cornell, Ruiqing Yin, Sunit, Sivasankaran, Herv\'e Bredin, Pavel Korshunov, Alessio Brutti, Romain, Serizel, Emmanuel Vincent, Nicholas Evans, S\'ebastien Marcel, Stefano, Squartini, Claude Barras

PDF

Open Access

TL;DR

This paper details the Speed team's speaker diarization systems for DIHARD II, highlighting system components, lessons learned, and performance improvements over baselines in challenging real-world scenarios.

Contribution

The paper introduces a robust diarization system with multiple components and insights into effective approaches, outperforming challenge baselines.

Findings

01

System significantly outperformed baselines.

02

Component analysis revealed key factors affecting performance.

03

Lessons learned inform future diarization system design.

Abstract

This paper describes the speaker diarization systems developed for the Second DIHARD Speech Diarization Challenge (DIHARD II) by the Speed team. Besides describing the system, which considerably outperformed the challenge baselines, we also focus on the lessons learned from numerous approaches that we tried for single and multi-channel systems. We present several components of our diarization system, including categorization of domains, speech enhancement, speech activity detection, speaker embeddings, clustering methods, resegmentation, and system fusion. We analyze and discuss the effect of each such component on the overall diarization performance within the realistic settings of the challenge.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings