An Integrated Framework for Two-pass Personalized Voice Trigger

Dexin Liao; Jing Li; Yiming Zhi; Song Li; Qingyang Hong; Lin Li

arXiv:2106.15950·eess.AS·July 1, 2021

An Integrated Framework for Two-pass Personalized Voice Trigger

Dexin Liao, Jing Li, Yiming Zhi, Song Li, Qingyang Hong, Lin Li

PDF

Open Access

TL;DR

This paper introduces the XMUSPEECH system for personalized voice trigger, combining keyword spotting and speaker verification with novel neural network architectures, achieving significant performance improvements.

Contribution

It presents a joint system with TDSC-ResNet for wake-up word detection and a multi-task learning network for speaker verification, integrating phonetic and speaker information.

Findings

01

Significant performance improvements over baseline.

02

Effective multi-task learning with CTC loss.

03

Enhanced wake-up word detection accuracy.

Abstract

In this paper, we present the XMUSPEECH system for Task 1 of 2020 Personalized Voice Trigger Challenge (PVTC2020). Task 1 is a joint wake-up word detection with speaker verification on close talking data. The whole system consists of a keyword spotting (KWS) sub-system and a speaker verification (SV) sub-system. For the KWS system, we applied a Temporal Depthwise Separable Convolution Residual Network (TDSC-ResNet) to improve the system's performance. For the SV system, we proposed a multi-task learning network, where phonetic branch is trained with the character label of the utterance, and speaker branch is trained with the label of the speaker. Phonetic branch is optimized with connectionist temporal classification (CTC) loss, which is treated as an auxiliary module for speaker branch. Experiments show that our system gets significant improvements compared with baseline system.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsConvolution · Depthwise Convolution · Pointwise Convolution · Depthwise Separable Convolution