Transformer-based encoder-encoder architecture for Spoken Term Detection
Jan \v{S}vec, Lubo\v{s} \v{S}m\'idl, Jan Lehe\v{c}ka

TL;DR
This paper introduces a Transformer-based encoder-encoder architecture for spoken term detection that leverages shared embeddings and outperforms LSTM-based baselines on English and Czech datasets.
Contribution
The novel encoder-encoder architecture with convolutional, upsampling, and attention masking modifications improves spoken term detection performance.
Findings
Outperforms LSTM-based baseline methods
Effective on English and Czech datasets
Utilizes shared embedding space for scoring
Abstract
The paper presents a method for spoken term detection based on the Transformer architecture. We propose the encoder-encoder architecture employing two BERT-like encoders with additional modifications, including convolutional and upsampling layers, attention masking, and shared parameters. The encoders project a recognized hypothesis and a searched term into a shared embedding space, where the score of the putative hit is computed using the calibrated dot product. In the experiments, we used the Wav2Vec 2.0 speech recognizer, and the proposed system outperformed a baseline method based on deep LSTMs on the English and Czech STD datasets based on USC Shoah Foundation Visual History Archive (MALACH).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
MethodsMulti-Head Attention · Attention Is All You Need · Spatial-Channel Token Distillation · Linear Layer · Softmax · Adam · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Absolute Position Encodings
