End-to-end Phoneme Sequence Recognition using Convolutional Neural Networks
Dimitri Palaz, Ronan Collobert, Mathew Magimai.-Doss

TL;DR
This paper demonstrates that convolutional neural networks can directly learn phoneme sequences from raw speech signals, achieving comparable performance to traditional MFCC-based systems, thus reducing reliance on hand-crafted features.
Contribution
It introduces an end-to-end CNN approach for raw speech phoneme recognition, challenging the necessity of complex feature extraction.
Findings
Comparable performance on TIMIT and WSJ datasets
CNN can learn directly from raw signals
Reduces need for hand-crafted features
Abstract
Most phoneme recognition state-of-the-art systems rely on a classical neural network classifiers, fed with highly tuned features, such as MFCC or PLP features. Recent advances in ``deep learning'' approaches questioned such systems, but while some attempts were made with simpler features such as spectrograms, state-of-the-art systems still rely on MFCCs. This might be viewed as a kind of failure from deep learning approaches, which are often claimed to have the ability to train with raw signals, alleviating the need of hand-crafted features. In this paper, we investigate a convolutional neural network approach for raw speech signals. While convolutional architectures got tremendous success in computer vision or text processing, they seem to have been let down in the past recent years in the speech processing field. We show that it is possible to learn an end-to-end phoneme sequence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
