Speech and Speaker Recognition from Raw Waveform with SincNet
Mirco Ravanelli, Yoshua Bengio

TL;DR
This paper introduces SincNet, a CNN architecture that processes raw audio waveforms for speech and speaker recognition, using parametrized sinc functions to learn meaningful filters efficiently and effectively.
Contribution
SincNet is a novel CNN that learns band-pass filter parameters directly from data, improving training speed, accuracy, and efficiency over standard CNNs in speech tasks.
Findings
Faster convergence compared to standard CNNs
Improved recognition performance
More computationally efficient model
Abstract
Deep neural networks can learn complex and abstract representations, that are progressively obtained by combining simpler ones. A recent trend in speech and speaker recognition consists in discovering these representations starting from raw audio samples directly. Differently from standard hand-crafted features such as MFCCs or FBANK, the raw waveform can potentially help neural networks discover better and more customized representations. The high-dimensional raw inputs, however, can make training significantly more challenging. This paper summarizes our recent efforts to develop a neural architecture that efficiently processes speech from audio waveforms. In particular, we propose SincNet, a novel Convolutional Neural Network (CNN) that encourages the first layer to discover meaningful filters by exploiting parametrized sinc functions. In contrast to standard CNNs, which learn all the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
