Detection of Doctored Speech: Towards an End-to-End Parametric Learn-able Filter Approach
Rohit Arora

TL;DR
This paper proposes an end-to-end deep learning approach using wavelet-based layers for detecting doctored speech, significantly improving spoof detection accuracy over traditional features and baseline models.
Contribution
It introduces wavelet scattering and continuous wavelet transform layers into deep neural networks for speech spoof detection, with a novel wavelet deconvolution layer for optimized feature extraction.
Findings
Wavelet-based models outperform traditional handcrafted features.
The Wavelet Deconvolution layer improves model performance.
Significant relative improvements on ASVspoof 2019 dataset.
Abstract
The Automatic Speaker Verification systems have potential in biometrics applications for logical control access and authentication. A lot of things happen to be at stake if the ASV system is compromised. The preliminary work presents a comparative analysis of the wavelet and MFCC-based state-of-the-art spoof detection techniques developed in these papers, respectively (Novoselov et al., 2016) (Alam et al., 2016a). The results on ASVspoof 2015 justify our inclination towards wavelet-based features instead of MFCC features. The experiments on the ASVspoof 2019 database show the lack of credibility of the traditional handcrafted features and give us more reason to progress towards using end-to-end deep neural networks and more recent techniques. We use Sincnet architecture as our baseline. We get E2E deep learning models, which we call WSTnet and CWTnet, respectively, by replacing the Sinc…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
