The IBM 2016 English Conversational Telephone Speech Recognition System
George Saon, Tom Sercu, Steven Rennie, Hong-Kwang J. Kuo

TL;DR
This paper presents an advanced English conversational telephone speech recognition system that achieves a new record low word error rate of 6.6% on the Switchboard dataset by combining multiple acoustic models and sophisticated language models.
Contribution
It introduces a novel combination of acoustic models including recurrent, convolutional, and LSTM networks, along with improved language modeling techniques, to significantly reduce word error rates.
Findings
Achieved a 6.6% word error rate on Switchboard
Demonstrated the effectiveness of model fusion and hierarchical neural LMs
Set a new benchmark for conversational speech recognition
Abstract
We describe a collection of acoustic and language modeling techniques that lowered the word error rate of our English conversational telephone LVCSR system to a record 6.6% on the Switchboard subset of the Hub5 2000 evaluation testset. On the acoustic side, we use a score fusion of three strong models: recurrent nets with maxout activations, very deep convolutional nets with 3x3 kernels, and bidirectional long short-term memory nets which operate on FMLLR and i-vector features. On the language modeling side, we use an updated model "M" and hierarchical neural network LMs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMaxout
