Achieving Human Parity in Conversational Speech Recognition
W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig

TL;DR
This paper demonstrates that a new speech recognition system can match and slightly surpass human transcription accuracy on conversational speech, marking a significant milestone in the field.
Contribution
The paper introduces a state-of-the-art speech recognition system that achieves human parity on conversational speech benchmarks, utilizing advanced neural network architectures and training techniques.
Findings
Automated system error rates: 5.8% on Switchboard, 11.0% on CallHome.
Human transcriber error rates: 5.9% on Switchboard, 11.3% on CallHome.
System surpasses human performance, setting new state-of-the-art results.
Abstract
Conversational speech recognition has served as a flagship speech recognition task since the release of the Switchboard corpus in the 1990s. In this paper, we measure the human error rate on the widely used NIST 2000 test set, and find that our latest automated system has reached human parity. The error rate of professional transcribers is 5.9% for the Switchboard portion of the data, in which newly acquainted pairs of people discuss an assigned topic, and 11.3% for the CallHome portion where friends and family members have open-ended conversations. In both cases, our automated system establishes a new state of the art, and edges past the human benchmark, achieving error rates of 5.8% and 11.0%, respectively. The key to our system's performance is the use of various convolutional and LSTM acoustic model architectures, combined with a novel spatial smoothing method and lattice-free MMI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
