Transformer-based encoder-encoder architecture for Spoken Term Detection

Jan \v{S}vec; Lubo\v{s} \v{S}m\'idl; Jan Lehe\v{c}ka

arXiv:2211.01089·cs.CL·November 3, 2022

Transformer-based encoder-encoder architecture for Spoken Term Detection

Jan \v{S}vec, Lubo\v{s} \v{S}m\'idl, Jan Lehe\v{c}ka

PDF

Open Access

TL;DR

This paper introduces a Transformer-based encoder-encoder architecture for spoken term detection that leverages shared embeddings and outperforms LSTM-based baselines on English and Czech datasets.

Contribution

The novel encoder-encoder architecture with convolutional, upsampling, and attention masking modifications improves spoken term detection performance.

Findings

01

Outperforms LSTM-based baseline methods

02

Effective on English and Czech datasets

03

Utilizes shared embedding space for scoring

Abstract

The paper presents a method for spoken term detection based on the Transformer architecture. We propose the encoder-encoder architecture employing two BERT-like encoders with additional modifications, including convolutional and upsampling layers, attention masking, and shared parameters. The encoders project a recognized hypothesis and a searched term into a shared embedding space, where the score of the putative hit is computed using the calibrated dot product. In the experiments, we used the Wav2Vec 2.0 speech recognizer, and the proposed system outperformed a baseline method based on deep LSTMs on the English and Czech STD datasets based on USC Shoah Foundation Visual History Archive (MALACH).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research

MethodsMulti-Head Attention · Attention Is All You Need · Spatial-Channel Token Distillation · Linear Layer · Softmax · Adam · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Absolute Position Encodings