Adaptive Speaker Embedding Self-Augmentation for Personal Voice Activity Detection with Short Enrollment Speech

Fuyuan Feng; Wenbin Zhang; Yu Gao; Longting Xu; Xiaofeng Mou; Yi Xu

arXiv:2601.12769·eess.AS·January 21, 2026

Adaptive Speaker Embedding Self-Augmentation for Personal Voice Activity Detection with Short Enrollment Speech

Fuyuan Feng, Wenbin Zhang, Yu Gao, Longting Xu, Xiaofeng Mou, Yi Xu

PDF

Open Access

TL;DR

This paper introduces an adaptive self-augmentation method for speaker embeddings in personal voice activity detection, improving performance with short enrollment speech by iterative refinement and embedding fusion.

Contribution

It proposes a novel adaptive self-augmentation strategy and long-term iterative refinement to enhance PVAD accuracy with limited enrollment data.

Findings

01

Significant improvements in recall, precision, and F1-score under short enrollment conditions.

02

Matching full-length enrollment performance after five iterative updates.

03

Source code availability for reproducibility.

Abstract

Personal Voice Activity Detection (PVAD) is crucial for identifying target speaker segments in the mixture, yet its performance heavily depends on the quality of speaker embeddings. A key practical limitation is the short enrollment speech--such as a wake-up word--which provides limited cues. This paper proposes a novel adaptive speaker embedding self-augmentation strategy that enhances PVAD performance by augmenting the original enrollment embeddings through additive fusion of keyframe embeddings extracted from mixed speech. Furthermore, we introduce a long-term adaptation strategy to iteratively refine embeddings during detection, mitigating speaker temporal variability. Experiments show significant gains in recall, precision, and F1-score under short enrollment conditions, matching full-length enrollment performance after five iterative updates. The source code is available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition