The IBM 2016 English Conversational Telephone Speech Recognition System

George Saon; Tom Sercu; Steven Rennie; Hong-Kwang J. Kuo

arXiv:1604.08242·cs.CL·June 23, 2016

The IBM 2016 English Conversational Telephone Speech Recognition System

George Saon, Tom Sercu, Steven Rennie, Hong-Kwang J. Kuo

PDF

TL;DR

This paper presents an advanced English conversational telephone speech recognition system that achieves a new record low word error rate of 6.6% on the Switchboard dataset by combining multiple acoustic models and sophisticated language models.

Contribution

It introduces a novel combination of acoustic models including recurrent, convolutional, and LSTM networks, along with improved language modeling techniques, to significantly reduce word error rates.

Findings

01

Achieved a 6.6% word error rate on Switchboard

02

Demonstrated the effectiveness of model fusion and hierarchical neural LMs

03

Set a new benchmark for conversational speech recognition

Abstract

We describe a collection of acoustic and language modeling techniques that lowered the word error rate of our English conversational telephone LVCSR system to a record 6.6% on the Switchboard subset of the Hub5 2000 evaluation testset. On the acoustic side, we use a score fusion of three strong models: recurrent nets with maxout activations, very deep convolutional nets with 3x3 kernels, and bidirectional long short-term memory nets which operate on FMLLR and i-vector features. On the language modeling side, we use an updated model "M" and hierarchical neural network LMs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMaxout