Deliberation Model Based Two-Pass End-to-End Speech Recognition

Ke Hu; Tara N. Sainath; Ruoming Pang; Rohit Prabhavalkar

arXiv:2003.07962·eess.AS·March 19, 2020·1 cites

Deliberation Model Based Two-Pass End-to-End Speech Recognition

Ke Hu, Tara N. Sainath, Ruoming Pang, Rohit Prabhavalkar

PDF

Open Access

TL;DR

This paper introduces a deliberation-based two-pass end-to-end speech recognition model that leverages both acoustics and first-pass hypotheses, significantly reducing word error rates in Google Voice Search tasks.

Contribution

It proposes a novel deliberation network that attends to both acoustics and hypotheses, improving recognition accuracy over existing rescoring methods.

Findings

01

Achieves 12% relative WER reduction on Google Voice Search

02

Attains 23% reduction on a proper noun test set

03

Performs 21% better than a large conventional model in relative terms

Abstract

End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively relative to conventional models. To further improve the quality, a two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model while maintaining a reasonable latency. The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses. In this work, we propose to attend to both acoustics and first-pass hypotheses using a deliberation network. A bidirectional encoder is used to extract context information from first-pass hypotheses. The proposed deliberation model achieves 12% relative WER reduction compared to LAS rescoring in Google Voice Search (VS) tasks, and 23% reduction on a proper noun test set. Compared to a large conventional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing