TL;DR
This paper introduces a learnable filterbank approach for raw speech that improves phone recognition accuracy by jointly training filterbanks with neural networks, outperforming traditional mel-filterbanks.
Contribution
It proposes a novel end-to-end trainable time-domain filterbank that adapts from mel-filterbanks and enhances phone recognition performance.
Findings
TD-filterbanks outperform mel-filterbanks in experiments
Learning all front-end steps yields best results
Filters develop asymmetric impulse responses at convergence
Abstract
We train a bank of complex filters that operates on the raw waveform and is fed into a convolutional neural network for end-to-end phone recognition. These time-domain filterbanks (TD-filterbanks) are initialized as an approximation of mel-filterbanks, and then fine-tuned jointly with the remaining convolutional architecture. We perform phone recognition experiments on TIMIT and show that for several architectures, models trained on TD-filterbanks consistently outperform their counterparts trained on comparable mel-filterbanks. We get our best performance by learning all front-end steps, from pre-emphasis up to averaging. Finally, we observe that the filters at convergence have an asymmetric impulse response, and that some of them remain almost analytic.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
