Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

Ronan Collobert; Christian Puhrsch; Gabriel Synnaeve

arXiv:1609.03193·cs.LG·September 14, 2016·248 cites

Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

Ronan Collobert, Christian Puhrsch, Gabriel Synnaeve

PDF

Open Access 5 Repos

TL;DR

This paper introduces Wav2Letter, an end-to-end convolutional neural network for speech recognition that simplifies training and achieves competitive word error rates on Librispeech, using raw waveform and MFCC features.

Contribution

It presents a novel end-to-end speech recognition model with a new segmentation criterion that simplifies training without requiring phoneme alignment.

Findings

01

Competitive word error rate on Librispeech with MFCC features

02

Promising results using raw waveform input

03

Simpler training process comparable to CTC

Abstract

This paper presents a simple end-to-end model for speech recognition, combining a convolutional network based acoustic model and a graph decoding. It is trained to output letters, with transcribed speech, without the need for force alignment of phonemes. We introduce an automatic segmentation criterion for training from sequence annotation without alignment that is on par with CTC while being simpler. We show competitive results in word error rate on the Librispeech corpus with MFCC features, and promising results from raw waveform.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques