On the limit of English conversational speech recognition

Zolt\'an T\"uske; George Saon; Brian Kingsbury

arXiv:2105.00982·cs.CL·May 4, 2021

On the limit of English conversational speech recognition

Zolt\'an T\"uske, George Saon, Brian Kingsbury

PDF

TL;DR

This paper advances conversational speech recognition by improving models and integration techniques, achieving near-ultimate performance on Switchboard benchmarks with state-of-the-art results and exploring the potential of conformer and self-attention models.

Contribution

It introduces improved optimization, speaker embeddings, and speech representations, and demonstrates the effectiveness of combining conformer and LSTM models with advanced language models.

Findings

01

Reduced WER on Switchboard-300 by 4% relative

02

Achieved 5.9% and 11.5% WER on SWB and CHM parts of Hub5'00

03

Reached near the limit of the Switchboard-300 benchmark

Abstract

In our previous work we demonstrated that a single headed attention encoder-decoder model is able to reach state-of-the-art results in conversational speech recognition. In this paper, we further improve the results for both Switchboard 300 and 2000. Through use of an improved optimizer, speaker vector embeddings, and alternative speech representations we reduce the recognition errors of our LSTM system on Switchboard-300 by 4% relative. Compensation of the decoder model with the probability ratio approach allows more efficient integration of an external language model, and we report 5.9% and 11.5% WER on the SWB and CHM parts of Hub5'00 with very simple LSTM models. Our study also considers the recently proposed conformer, and more advanced self-attention based language models. Overall, the conformer shows similar performance to the LSTM; nevertheless, their combination and decoding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory