Neural Network based End-to-End Query by Example Spoken Term Detection
Dhananjay Ram, Lesly Miculicich, Herv\'e Bourlard

TL;DR
This paper introduces a neural network end-to-end framework for query by example spoken term detection, outperforming traditional DTW-based methods by jointly optimizing feature extraction and pattern matching.
Contribution
It presents the first fully neural network-based end-to-end system for QbE-STD, replacing separate feature extraction and matching stages with joint optimization.
Findings
Multilingual bottleneck features improve with more training languages.
CNN-based matching outperforms DTW-based matching with bottleneck features.
End-to-end training significantly improves detection performance.
Abstract
This paper focuses on the problem of query by example spoken term detection (QbE-STD) in zero-resource scenario. State-of-the-art approaches primarily rely on dynamic time warping (DTW) based template matching techniques using phone posterior or bottleneck features extracted from a deep neural network (DNN). We use both monolingual and multilingual bottleneck features, and show that multilingual features perform increasingly better with more training languages. Previously, it has been shown that the DTW based matching can be replaced with a CNN based matching while using posterior features. Here, we show that the CNN based matching outperforms DTW based matching using bottleneck features as well. In this case, the feature extraction and pattern matching stages of our QbE-STD system are optimized independently of each other. We propose to integrate these two stages in a fully neural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDynamic Time Warping
