Joint Speech Activity and Overlap Detection with Multi-Exit Architecture
Ziqing Du, Kai Liu, Xucheng Wan, Huan Zhou

TL;DR
This paper introduces a multi-exit neural network architecture for joint speech activity and overlap detection, achieving state-of-the-art results and offering efficient deployment options.
Contribution
It proposes a novel multi-exit architecture with training schemes like knowledge distillation and dense connection for improved joint VAD and OSD performance.
Findings
Outperforms existing models on AMI and DIHARD-III datasets.
Achieves high F1 scores of 0.792 and 0.625 respectively.
Offers a flexible system for quality and complexity trade-offs.
Abstract
Overlapped speech detection (OSD) is critical for speech applications in scenario of multi-party conversion. Despite numerous research efforts and progresses, comparing with speech activity detection (VAD), OSD remains an open challenge and its overall performance is far from satisfactory. The majority of prior research typically formulates the OSD problem as a standard classification problem, to identify speech with binary (OSD) or three-class label (joint VAD and OSD) at frame level. In contrast to the mainstream, this study investigates the joint VAD and OSD task from a new perspective. In particular, we propose to extend traditional classification network with multi-exit architecture. Such an architecture empowers our system with unique capability to identify class using either low-level features from early exits or high-level features from last exit. In addition, two training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
MethodsKnowledge Distillation
