Wav2Letter: an End-to-End ConvNet-based Speech Recognition System
Ronan Collobert, Christian Puhrsch, Gabriel Synnaeve

TL;DR
This paper introduces Wav2Letter, an end-to-end convolutional neural network for speech recognition that simplifies training and achieves competitive word error rates on Librispeech, using raw waveform and MFCC features.
Contribution
It presents a novel end-to-end speech recognition model with a new segmentation criterion that simplifies training without requiring phoneme alignment.
Findings
Competitive word error rate on Librispeech with MFCC features
Promising results using raw waveform input
Simpler training process comparable to CTC
Abstract
This paper presents a simple end-to-end model for speech recognition, combining a convolutional network based acoustic model and a graph decoding. It is trained to output letters, with transcribed speech, without the need for force alignment of phonemes. We introduce an automatic segmentation criterion for training from sequence annotation without alignment that is on par with CTC while being simpler. We show competitive results in word error rate on the Librispeech corpus with MFCC features, and promising results from raw waveform.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
