Joint speech and overlap detection: a benchmark over multiple audio setup and speech domains
Martin Lebourdais (LIUM), Th\'eo Mariotte (LIUM, LAUM), Marie Tahon, (LIUM), Anthony Larcher (LIUM), Antoine Laurent (LIUM), Silvio Montresor, (LAUM), Sylvain Meignier (LIUM), Jean-Hugh Thomas (LAUM)

TL;DR
This paper introduces a comprehensive benchmark for joint voice activity and overlapped speech detection across various audio setups and speech domains, demonstrating that combined models can match dedicated systems' performance while reducing training costs.
Contribution
It presents a new benchmark for VAD and OSD models across multiple audio and speech domains, and proposes a joint model that outperforms state-of-the-art results with reduced training effort.
Findings
Joint models achieve similar F1-scores to separate systems.
Proposed architecture works for both single and multi-channel audio.
Models outperform existing benchmarks in various setups.
Abstract
Voice activity and overlapped speech detection (respectively VAD and OSD) are key pre-processing tasks for speaker diarization. The final segmentation performance highly relies on the robustness of these sub-tasks. Recent studies have shown VAD and OSD can be trained jointly using a multi-class classification model. However, these works are often restricted to a specific speech domain, lacking information about the generalization capacities of the systems. This paper proposes a complete and new benchmark of different VAD and OSD models, on multiple audio setups (single/multi-channel) and speech domains (e.g. media, meeting...). Our 2/3-class systems, which combine a Temporal Convolutional Network with speech representations adapted to the setup, outperform state-of-the-art results. We show that the joint training of these two tasks offers similar performances in terms of F1-score to two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
