Experimental Study: Enhancing Voice Spoofing Detection Models with wav2vec 2.0
Taein Kang, Soyul Han, Sunmook Choi, Jaejin Seo, Sanghyeok Chung,, Seungeun Lee, Seungsang Oh, Il-Youp Kwak

TL;DR
This study evaluates the effectiveness of wav2vec 2.0 as a raw speech feature extractor for voice spoofing detection, demonstrating that optimized configurations can outperform traditional handcrafted features on benchmark datasets.
Contribution
It introduces a systematic analysis of wav2vec 2.0 layer selection and fine-tuning strategies for spoofing detection, achieving state-of-the-art results.
Findings
Wav2vec 2.0 features can surpass handcrafted features in spoofing detection.
Layer selection and fine-tuning significantly impact detection performance.
Optimal configurations achieve state-of-the-art results on ASVspoof 2019 LA dataset.
Abstract
Conventional spoofing detection systems have heavily relied on the use of handcrafted features derived from speech data. However, a notable shift has recently emerged towards the direct utilization of raw speech waveforms, as demonstrated by methods like SincNet filters. This shift underscores the demand for more sophisticated audio sample features. Moreover, the success of deep learning models, particularly those utilizing large pretrained wav2vec 2.0 as a featurization front-end, highlights the importance of refined feature encoders. In response, this research assessed the representational capability of wav2vec 2.0 as an audio feature extractor, modifying the size of its pretrained Transformer layers through two key adjustments: (1) selecting a subset of layers starting from the leftmost one and (2) fine-tuning a portion of the selected layers from the rightmost one. We complemented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
