Improving Voice Trigger Detection with Metric Learning

Prateeth Nayak; Takuya Higuchi; Anmol Gupta; Shivesh Ranjan; Stephen; Shum; Siddharth Sigtia; Erik Marchi; Varun Lakshminarasimhan; Minsik Cho,; Saurabh Adya; Chandra Dhir; Ahmed Tewfik

arXiv:2204.02455·cs.SD·September 15, 2022

Improving Voice Trigger Detection with Metric Learning

Prateeth Nayak, Takuya Higuchi, Anmol Gupta, Shivesh Ranjan, Stephen, Shum, Siddharth Sigtia, Erik Marchi, Varun Lakshminarasimhan, Minsik Cho,, Saurabh Adya, Chandra Dhir, Ahmed Tewfik

PDF

Open Access

TL;DR

This paper introduces a novel voice trigger detection method that personalizes detection by using a small amount of target speaker data, significantly reducing false rejections especially for underrepresented groups.

Contribution

The proposed encoder-decoder model enables personalized voice trigger detection by predicting speaker-specific embeddings, improving accuracy over traditional speaker-independent detectors.

Findings

01

Achieves 38% relative reduction in false rejection rate

02

Effective personalization with minimal target speaker data

03

Improves detection accuracy for accented and underrepresented speakers

Abstract

Voice trigger detection is an important task, which enables activating a voice assistant when a target user speaks a keyword phrase. A detector is typically trained on speech data independent of speaker information and used for the voice trigger detection task. However, such a speaker independent voice trigger detector typically suffers from performance degradation on speech from underrepresented groups, such as accented speakers. In this work, we propose a novel voice trigger detector that can use a small number of utterances from a target speaker to improve detection accuracy. Our proposed model employs an encoder-decoder architecture. While the encoder performs speaker independent voice trigger detection, similar to the conventional detector, the decoder predicts a personalized embedding for each utterance. A personalized voice trigger score is then obtained as a similarity score…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing