The Microsoft 2017 Conversational Speech Recognition System

W. Xiong; L. Wu; F. Alleva; J. Droppo; X. Huang; A. Stolcke

arXiv:1708.06073·cs.CL·February 28, 2022

The Microsoft 2017 Conversational Speech Recognition System

W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, A. Stolcke

PDF

TL;DR

This paper presents the 2017 version of Microsoft's conversational speech recognition system, which incorporates neural network advancements and achieves state-of-the-art performance on the Switchboard task.

Contribution

The system introduces a CNN-BLSTM acoustic model and dialog-aware language models, with a novel two-stage system combination and rescoring approach.

Findings

01

Achieved 5.1% WER on Switchboard

02

Integrated CNN-BLSTM acoustic model

03

Enhanced system with confusion network rescoring

Abstract

We describe the 2017 version of Microsoft's conversational speech recognition system, in which we update our 2016 system with recent developments in neural-network-based acoustic and language modeling to further advance the state of the art on the Switchboard speech recognition task. The system adds a CNN-BLSTM acoustic model to the set of model architectures we combined previously, and includes character-based and dialog session aware LSTM language models in rescoring. For system combination we adopt a two-stage approach, whereby subsets of acoustic models are first combined at the senone/frame level, followed by a word-level voting via confusion networks. We also added a confusion network rescoring step after system combination. The resulting system yields a 5.1\% word error rate on the 2000 Switchboard evaluation set.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory