Learning Filterbanks from Raw Speech for Phone Recognition

Neil Zeghidour; Nicolas Usunier; Iasonas Kokkinos; Thomas Schatz,; Gabriel Synnaeve; Emmanuel Dupoux

arXiv:1711.01161·cs.CL·April 5, 2018

Learning Filterbanks from Raw Speech for Phone Recognition

Neil Zeghidour, Nicolas Usunier, Iasonas Kokkinos, Thomas Schatz,, Gabriel Synnaeve, Emmanuel Dupoux

PDF

2 Repos

TL;DR

This paper introduces a learnable filterbank approach for raw speech that improves phone recognition accuracy by jointly training filterbanks with neural networks, outperforming traditional mel-filterbanks.

Contribution

It proposes a novel end-to-end trainable time-domain filterbank that adapts from mel-filterbanks and enhances phone recognition performance.

Findings

01

TD-filterbanks outperform mel-filterbanks in experiments

02

Learning all front-end steps yields best results

03

Filters develop asymmetric impulse responses at convergence

Abstract

We train a bank of complex filters that operates on the raw waveform and is fed into a convolutional neural network for end-to-end phone recognition. These time-domain filterbanks (TD-filterbanks) are initialized as an approximation of mel-filterbanks, and then fine-tuned jointly with the remaining convolutional architecture. We perform phone recognition experiments on TIMIT and show that for several architectures, models trained on TD-filterbanks consistently outperform their counterparts trained on comparable mel-filterbanks. We get our best performance by learning all front-end steps, from pre-emphasis up to averaging. Finally, we observe that the filters at convergence have an asymmetric impulse response, and that some of them remain almost analytic.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.