TL;DR
This paper introduces a method to directly train attention-based sequence-to-sequence speech recognition models to minimize word error rate, leading to significant performance improvements over traditional training methods.
Contribution
It proposes a novel training approach that optimizes expected word error rate using N-best list approximations, matching state-of-the-art discriminative systems.
Findings
Achieves up to 8.2% relative WER reduction.
Matches performance of traditional discriminative models on voice-search.
Effective training method for grapheme-based attention models.
Abstract
Sequence-to-sequence models, such as attention-based models in automatic speech recognition (ASR), are typically trained to optimize the cross-entropy criterion which corresponds to improving the log-likelihood of the data. However, system performance is usually measured in terms of word error rate (WER), not log-likelihood. Traditional ASR systems benefit from discriminative sequence training which optimizes criteria such as the state-level minimum Bayes risk (sMBR) which are more closely related to WER. In the present work, we explore techniques to train attention-based models to directly minimize expected word error rate. We consider two loss functions which approximate the expected number of word errors: either by sampling from the model, or by using N-best lists of decoded hypotheses, which we find to be more effective than the sampling-based method. In experimental evaluations, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
