Improving Voice Trigger Detection with Metric Learning
Prateeth Nayak, Takuya Higuchi, Anmol Gupta, Shivesh Ranjan, Stephen, Shum, Siddharth Sigtia, Erik Marchi, Varun Lakshminarasimhan, Minsik Cho,, Saurabh Adya, Chandra Dhir, Ahmed Tewfik

TL;DR
This paper introduces a novel voice trigger detection method that personalizes detection by using a small amount of target speaker data, significantly reducing false rejections especially for underrepresented groups.
Contribution
The proposed encoder-decoder model enables personalized voice trigger detection by predicting speaker-specific embeddings, improving accuracy over traditional speaker-independent detectors.
Findings
Achieves 38% relative reduction in false rejection rate
Effective personalization with minimal target speaker data
Improves detection accuracy for accented and underrepresented speakers
Abstract
Voice trigger detection is an important task, which enables activating a voice assistant when a target user speaks a keyword phrase. A detector is typically trained on speech data independent of speaker information and used for the voice trigger detection task. However, such a speaker independent voice trigger detector typically suffers from performance degradation on speech from underrepresented groups, such as accented speakers. In this work, we propose a novel voice trigger detector that can use a small number of utterances from a target speaker to improve detection accuracy. Our proposed model employs an encoder-decoder architecture. While the encoder performs speaker independent voice trigger detection, similar to the conventional detector, the decoder predicts a personalized embedding for each utterance. A personalized voice trigger score is then obtained as a similarity score…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
