The NPU System for the 2020 Personalized Voice Trigger Challenge
Jingyong Hou, Li Zhang, Yihui Fu, Qing Wang, Zhanheng Yang, Qijie, Shao, Lei Xie

TL;DR
This paper presents the NPU team's system for the 2020 personalized voice trigger challenge, combining a multi-scale dilated temporal convolutional keyword spotting system with a speaker verification system for accurate wake-up word detection and speaker identification.
Contribution
The paper introduces a novel multi-scale dilated temporal convolutional network for keyword spotting and integrates it with a speaker verification system for personalized voice trigger detection.
Findings
Achieved detection costs of 0.081 in close talking tasks.
Achieved detection costs of 0.091 in far-field tasks.
System effectively detects wake-up words and verifies speaker identity.
Abstract
This paper describes the system developed by the NPU team for the 2020 personalized voice trigger challenge. Our submitted system consists of two independently trained subsystems: a small footprint keyword spotting (KWS) system and a speaker verification (SV) system. For the KWS system, a multi-scale dilated temporal convolutional (MDTC) network is proposed to detect wake-up word (WuW). For SV system, Write something here. The KWS predicts posterior probabilities of whether an audio utterance contains WuW and estimates the location of WuW at the same time. When the posterior probability ofWuW reaches a predefined threshold, the identity information of triggered segment is determined by the SV system. On evaluation dataset, our submitted system obtains detection costs of 0.081and 0.091 in close talking and far-field tasks, respectively.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
