The DKU System Description for The Interspeech 2021 Auto-KWS Challenge

Yechen Wang; Yan Jia; Murong Ma; Zexin Cai; Ming Li

arXiv:2104.04993·eess.AS·April 13, 2021

The DKU System Description for The Interspeech 2021 Auto-KWS Challenge

Yechen Wang, Yan Jia, Murong Ma, Zexin Cai, Ming Li

PDF

Open Access

TL;DR

This paper describes a two-stage keyword spotting system combining dynamic time warping and acoustic word embeddings, achieving improved accuracy in the Auto-KWS 2021 Challenge.

Contribution

The paper introduces a novel two-stage keyword spotting approach that integrates template matching and acoustic word embeddings for enhanced detection performance.

Findings

01

Achieved an average score of 0.61 on the feedback dataset.

02

Outperformed the baseline system by 0.25 in the challenge.

03

Demonstrated effectiveness of combining DTW and embedding-based verification.

Abstract

This paper introduces the system submitted by the DKU-SMIIP team for the Auto-KWS 2021 Challenge. Our implementation consists of a two-stage keyword spotting system based on query-by-example spoken term detection and a speaker verification system. We employ two different detection algorithms in our proposed keyword spotting system. The first stage adopts subsequence dynamic time warping for template matching based on frame-level language-independent bottleneck feature and phoneme posterior probability. We use a sliding window template matching algorithm based on acoustic word embeddings to further verify the detection from the first stage. As a result, our KWS system achieves an average score of 0.61 on the feedback dataset, which outperforms the baseline1 system by 0.25.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing