Adaptive Speaker Embedding Self-Augmentation for Personal Voice Activity Detection with Short Enrollment Speech
Fuyuan Feng, Wenbin Zhang, Yu Gao, Longting Xu, Xiaofeng Mou, Yi Xu

TL;DR
This paper introduces an adaptive self-augmentation method for speaker embeddings in personal voice activity detection, improving performance with short enrollment speech by iterative refinement and embedding fusion.
Contribution
It proposes a novel adaptive self-augmentation strategy and long-term iterative refinement to enhance PVAD accuracy with limited enrollment data.
Findings
Significant improvements in recall, precision, and F1-score under short enrollment conditions.
Matching full-length enrollment performance after five iterative updates.
Source code availability for reproducibility.
Abstract
Personal Voice Activity Detection (PVAD) is crucial for identifying target speaker segments in the mixture, yet its performance heavily depends on the quality of speaker embeddings. A key practical limitation is the short enrollment speech--such as a wake-up word--which provides limited cues. This paper proposes a novel adaptive speaker embedding self-augmentation strategy that enhances PVAD performance by augmenting the original enrollment embeddings through additive fusion of keyframe embeddings extracted from mixed speech. Furthermore, we introduce a long-term adaptation strategy to iteratively refine embeddings during detection, mitigating speaker temporal variability. Experiments show significant gains in recall, precision, and F1-score under short enrollment conditions, matching full-length enrollment performance after five iterative updates. The source code is available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition
