# English Conversational Telephone Speech Recognition by Humans and   Machines

**Authors:** George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel, Thomas, Dimitrios Dimitriadis, Xiaodong Cui, Bhuvana Ramabhadran, Michael, Picheny, Lynn-Li Lim, Bergul Roomi, Phil Hall

arXiv: 1703.02136 · 2017-03-08

## TL;DR

This paper evaluates human and machine speech recognition performance on conversational telephone speech, demonstrating that advanced models have approached but not yet matched human accuracy, and introduces novel acoustic and language modeling techniques achieving new system performance milestones.

## Contribution

The paper provides new measurements of human performance on conversational speech and presents a speech recognition system with state-of-the-art accuracy using innovative acoustic and language models.

## Key findings

- Human performance may be better than previously reported.
- The system achieved 5.5%/10.3% error rates on Switchboard/CallHome.
- Novel fusion of multiple neural network models improved recognition accuracy.

## Abstract

One of the most difficult speech recognition tasks is accurate recognition of human to human communication. Advances in deep learning over the last few years have produced major speech recognition improvements on the representative Switchboard conversational corpus. Word error rates that just a few years ago were 14% have dropped to 8.0%, then 6.6% and most recently 5.8%, and are now believed to be within striking range of human performance. This then raises two issues - what IS human performance, and how far down can we still drive speech recognition error rates? A recent paper by Microsoft suggests that we have already achieved human performance. In trying to verify this statement, we performed an independent set of human performance measurements on two conversational tasks and found that human performance may be considerably better than what was earlier reported, giving the community a significantly harder goal to achieve. We also report on our own efforts in this area, presenting a set of acoustic and language modeling techniques that lowered the word error rate of our own English conversational telephone LVCSR system to the level of 5.5%/10.3% on the Switchboard/CallHome subsets of the Hub5 2000 evaluation, which - at least at the writing of this paper - is a new performance milestone (albeit not at what we measure to be human performance!). On the acoustic side, we use a score fusion of three models: one LSTM with multiple feature inputs, a second LSTM trained with speaker-adversarial multi-task learning and a third residual net (ResNet) with 25 convolutional layers and time-dilated convolutions. On the language modeling side, we use word and character LSTMs and convolutional WaveNet-style language models.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1703.02136/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/1703.02136/full.md

## References

31 references — full list in the complete paper: https://tomesphere.com/paper/1703.02136/full.md

---
Source: https://tomesphere.com/paper/1703.02136