Personal VAD 2.0: Optimizing Personal Voice Activity Detection for   On-Device Speech Recognition

Shaojin Ding; Rajeev Rikhye; Qiao Liang; Yanzhang He; Quan Wang; Arun; Narayanan; Tom O'Malley; Ian McGraw

arXiv:2204.03793·eess.AS·June 28, 2022·1 cites

Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition

Shaojin Ding, Rajeev Rikhye, Qiao Liang, Yanzhang He, Quan Wang, Arun, Narayanan, Tom O'Malley, Ian McGraw

PDF

Open Access

TL;DR

Personal VAD 2.0 introduces a personalized voice activity detection system optimized for on-device speech recognition, addressing challenges of quality, streaming operation, and resource constraints with novel methods and achieving state-of-the-art results.

Contribution

The paper presents novel speaker embedding modulation, a training paradigm for enrollment-less scenarios, and architecture optimizations for resource-limited on-device VAD.

Findings

01

Achieves state-of-the-art performance in personalized VAD

02

Operates effectively in streaming and enrollment-less scenarios

03

Optimized for low latency and limited resources

Abstract

Personalization of on-device speech recognition (ASR) has seen explosive growth in recent years, largely due to the increasing popularity of personal assistant features on mobile devices and smart home speakers. In this work, we present Personal VAD 2.0, a personalized voice activity detector that detects the voice activity of a target speaker, as part of a streaming on-device ASR system. Although previous proof-of-concept studies have validated the effectiveness of Personal VAD, there are still several critical challenges to address before this model can be used in production: first, the quality must be satisfactory in both enrollment and enrollment-less scenarios; second, it should operate in a streaming fashion; and finally, the model size should be small enough to fit a limited latency and CPU/Memory budget. To meet the multi-faceted requirements, we propose a series of novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing