End-to-End Speech Recognition From the Raw Waveform

Neil Zeghidour; Nicolas Usunier; Gabriel Synnaeve; Ronan Collobert,; Emmanuel Dupoux

arXiv:1806.07098·cs.CL·June 22, 2018

End-to-End Speech Recognition From the Raw Waveform

Neil Zeghidour, Nicolas Usunier, Gabriel Synnaeve, Ronan Collobert,, Emmanuel Dupoux

PDF

1 Repo

TL;DR

This paper demonstrates that end-to-end speech recognition models trained directly from raw waveforms with trainable convolutional front-ends outperform traditional mel-filterbank features on large vocabulary tasks, simplifying the pipeline.

Contribution

It introduces and systematically compares two trainable convolutional architectures inspired by gammatone filters and scattering transforms, improving raw waveform-based speech recognition.

Findings

01

Trainable filterbanks outperform mel-filterbanks in word error rate.

02

Modifications like instance normalization enhance training and performance.

03

First demonstration of raw waveform end-to-end models surpassing mel-filterbanks on large datasets.

Abstract

State-of-the-art speech recognition systems rely on fixed, hand-crafted features such as mel-filterbanks to preprocess the waveform before the training pipeline. In this paper, we study end-to-end systems trained directly from the raw waveform, building on two alternatives for trainable replacements of mel-filterbanks that use a convolutional architecture. The first one is inspired by gammatone filterbanks (Hoshen et al., 2015; Sainath et al, 2015), and the second one by the scattering transform (Zeghidour et al., 2017). We propose two modifications to these architectures and systematically compare them to mel-filterbanks, on the Wall Street Journal dataset. The first modification is the addition of an instance normalization layer, which greatly improves on the gammatone-based trainable filterbanks and speeds up the training of the scattering-based filterbanks. The second one relates to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

renyuanL/ry-Speech-commands
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsInstance Normalization