Attentional Speech Recognition Models Misbehave on Out-of-domain   Utterances

Phillip Keung; Wei Niu; Yichao Lu; Julian Salazar; Vikas Bhardwaj

arXiv:2002.05150·eess.AS·February 13, 2020·5 cites

Attentional Speech Recognition Models Misbehave on Out-of-domain Utterances

Phillip Keung, Wei Niu, Yichao Lu, Julian Salazar, Vikas Bhardwaj

PDF

Open Access 1 Repo

TL;DR

This paper investigates how attentional speech recognition models generate excessively long, repetitive outputs on out-of-domain utterances, revealing issues intrinsic to the attention mechanism and proposing a length prediction model to mitigate this problem.

Contribution

It identifies the problem of excessively long outputs in attentional ASR models on out-of-domain data and introduces a length prediction model to improve decoding robustness.

Findings

01

Attentional models produce overly long, repetitive transcripts on out-of-domain utterances.

02

Hybrid DNN-HMM models do not exhibit this problem, indicating a specific issue with attention mechanisms.

03

A length prediction model effectively identifies and truncates problematic outputs, maintaining accuracy.

Abstract

We discuss the problem of echographic transcription in autoregressive sequence-to-sequence attentional architectures for automatic speech recognition, where a model produces very long sequences of repetitive outputs when presented with out-of-domain utterances. We decode audio from the British National Corpus with an attentional encoder-decoder model trained solely on the LibriSpeech corpus. We observe that there are many 5-second recordings that produce more than 500 characters of decoding output (i.e. more than 100 characters per second). A frame-synchronous hybrid (DNN-HMM) model trained on the same data does not produce these unusually long transcripts. These decoding issues are reproducible in a speech transformer model from ESPnet, and to a lesser extent in a self-attention CTC model, suggesting that these issues are intrinsic to the use of the attention mechanism. We create a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aws-samples/seq2seq-asr-misbehaves
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax