Speaker Recognition from Raw Waveform with SincNet
Mirco Ravanelli, Yoshua Bengio

TL;DR
This paper introduces SincNet, a novel CNN architecture for speaker recognition from raw waveforms that learns meaningful band-pass filters efficiently, outperforming standard CNNs in speed and accuracy.
Contribution
SincNet's innovative use of parametrized sinc functions for filter design enhances feature learning and improves speaker recognition performance.
Findings
SincNet converges faster than standard CNNs.
SincNet achieves higher accuracy in speaker identification and verification.
The filter bank is specifically tuned for speaker recognition tasks.
Abstract
Deep learning is progressively gaining popularity as a viable alternative to i-vectors for speaker recognition. Promising results have been recently obtained with Convolutional Neural Networks (CNNs) when fed by raw speech samples directly. Rather than employing standard hand-crafted features, the latter CNNs learn low-level speech representations from waveforms, potentially allowing the network to better capture important narrow-band speaker characteristics such as pitch and formants. Proper design of the neural network is crucial to achieve this goal. This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters. SincNet is based on parametrized sinc functions, which implement band-pass filters. In contrast to standard CNNs, that learn all elements of each filter, only low and high cutoff frequencies are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
