Joint speech and overlap detection: a benchmark over multiple audio   setup and speech domains

Martin Lebourdais (LIUM); Th\'eo Mariotte (LIUM; LAUM); Marie Tahon; (LIUM); Anthony Larcher (LIUM); Antoine Laurent (LIUM); Silvio Montresor; (LAUM); Sylvain Meignier (LIUM); Jean-Hugh Thomas (LAUM)

arXiv:2307.13012·cs.SD·July 26, 2023·1 cites

Joint speech and overlap detection: a benchmark over multiple audio setup and speech domains

Martin Lebourdais (LIUM), Th\'eo Mariotte (LIUM, LAUM), Marie Tahon, (LIUM), Anthony Larcher (LIUM), Antoine Laurent (LIUM), Silvio Montresor, (LAUM), Sylvain Meignier (LIUM), Jean-Hugh Thomas (LAUM)

PDF

Open Access

TL;DR

This paper introduces a comprehensive benchmark for joint voice activity and overlapped speech detection across various audio setups and speech domains, demonstrating that combined models can match dedicated systems' performance while reducing training costs.

Contribution

It presents a new benchmark for VAD and OSD models across multiple audio and speech domains, and proposes a joint model that outperforms state-of-the-art results with reduced training effort.

Findings

01

Joint models achieve similar F1-scores to separate systems.

02

Proposed architecture works for both single and multi-channel audio.

03

Models outperform existing benchmarks in various setups.

Abstract

Voice activity and overlapped speech detection (respectively VAD and OSD) are key pre-processing tasks for speaker diarization. The final segmentation performance highly relies on the robustness of these sub-tasks. Recent studies have shown VAD and OSD can be trained jointly using a multi-class classification model. However, these works are often restricted to a specific speech domain, lacking information about the generalization capacities of the systems. This paper proposes a complete and new benchmark of different VAD and OSD models, on multiple audio setups (single/multi-channel) and speech domains (e.g. media, meeting...). Our 2/3-class systems, which combine a Temporal Convolutional Network with speech representations adapted to the setup, outperform state-of-the-art results. We show that the joint training of these two tasks offers similar performances in terms of F1-score to two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing