Enrollment-less training for personalized voice activity detection
Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Akihiko Takashima, Shota, Orihashi, Ryo Masumura

TL;DR
This paper introduces an enrollment-less training method for personalized voice activity detection that eliminates the need for enrollment data, reducing dataset preparation costs and enabling training with standard VAD datasets.
Contribution
The proposed enrollment-less training approach allows PVAD models to be trained without enrollment data by augmenting utterances to simulate variability, bridging the gap between training and inference.
Findings
The method effectively trains PVAD without enrollment data.
Experimental results show improved detection accuracy.
The approach reduces data collection costs.
Abstract
We present a novel personalized voice activity detection (PVAD) learning method that does not require enrollment data during training. PVAD is a task to detect the speech segments of a specific target speaker at the frame level using enrollment speech of the target speaker. Since PVAD must learn speakers' speech variations to clarify the boundary between speakers, studies on PVAD used large-scale datasets that contain many utterances for each speaker. However, the datasets to train a PVAD model are often limited because substantial cost is needed to prepare such a dataset. In addition, we cannot utilize the datasets used to train the standard VAD because they often lack speaker labels. To solve these problems, our key idea is to use one utterance as both a kind of enrollment speech and an input to the PVAD during training, which enables PVAD training without enrollment speech. In our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
