End-to-end Models with auditory attention in Multi-channel Keyword   Spotting

Haitong Zhang; Junbo Zhang; Yujun Wang

arXiv:1811.00350·cs.SD·November 6, 2018·1 cites

End-to-end Models with auditory attention in Multi-channel Keyword Spotting

Haitong Zhang, Junbo Zhang, Yujun Wang

PDF

Open Access

TL;DR

This paper introduces an attention-based end-to-end multi-channel keyword spotting model that outperforms traditional methods, especially in noisy environments, by leveraging transfer learning and multi-target spectral mapping.

Contribution

The paper presents a novel attention-based end-to-end model for multi-channel keyword spotting that improves robustness and performance using transfer learning and multi-target spectral mapping.

Findings

01

Outperforms baseline in clean and noisy data

02

Transfer learning improves robustness in noisy environments

03

Achieves 30% higher wake-up rate at 0.1 FA/hour in noisy conditions

Abstract

In this paper, we propose an attention-based end-to-end model for multi-channel keyword spotting (KWS), which is trained to optimize the KWS result directly. As a result, our model outperforms the baseline model with signal pre-processing techniques in both the clean and noisy testing data. We also found that multi-task learning results in a better performance when the training and testing data are similar. Transfer learning and multi-target spectral mapping can dramatically enhance the robustness to the noisy environment. At 0.1 false alarm (FA) per hour, the model with transfer learning and multi-target mapping gain an absolute 30% improvement in the wake-up rate in the noisy data with SNR about -20.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Speech Recognition and Synthesis