Device-Directed Speech Detection: Regularization via Distillation for   Weakly-Supervised Models

Vineet Garg; Ognjen Rudovic; Pranay Dighe; Ahmed H. Abdelaziz; Erik; Marchi; Saurabh Adya; Chandra Dhir; Ahmed Tewfik

arXiv:2203.15975·eess.AS·March 31, 2022

Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models

Vineet Garg, Ognjen Rudovic, Pranay Dighe, Ahmed H. Abdelaziz, Erik, Marchi, Saurabh Adya, Chandra Dhir, Ahmed Tewfik

PDF

Open Access

TL;DR

This paper introduces a weakly-supervised device-directed speech detection method that employs knowledge distillation from an ASR model to improve false trigger mitigation, achieving significant accuracy gains.

Contribution

It proposes a novel approach combining weakly-labeled data sampling and knowledge distillation to enhance device-directed speech detection without extensive manual annotation.

Findings

01

66% reduction in equal-error-rate (EER) over baseline

02

Ensemble of models further improves accuracy by 20%

03

Effective use of noisy, weakly-labeled data for training

Abstract

We address the problem of detecting speech directed to a device that does not contain a specific wake-word. Specifically, we focus on audio coming from a touch-based invocation. Mitigating virtual assistants (VAs) activation due to accidental button presses is critical for user experience. While the majority of approaches to false trigger mitigation (FTM) are designed to detect the presence of a target keyword, inferring user intent in absence of keyword is difficult. This also poses a challenge when creating the training/evaluation data for such systems due to inherent ambiguity in the user's data. To this end, we propose a novel FTM approach that uses weakly-labeled training data obtained with a newly introduced data sampling strategy. While this sampling strategy reduces data annotation efforts, the data labels are noisy as the data are not annotated manually. We use these data to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsKnowledge Distillation · Balanced Selection