Experimental Study: Enhancing Voice Spoofing Detection Models with   wav2vec 2.0

Taein Kang; Soyul Han; Sunmook Choi; Jaejin Seo; Sanghyeok Chung,; Seungeun Lee; Seungsang Oh; Il-Youp Kwak

arXiv:2402.17127·cs.SD·February 28, 2024·3 cites

Experimental Study: Enhancing Voice Spoofing Detection Models with wav2vec 2.0

Taein Kang, Soyul Han, Sunmook Choi, Jaejin Seo, Sanghyeok Chung,, Seungeun Lee, Seungsang Oh, Il-Youp Kwak

PDF

Open Access

TL;DR

This study evaluates the effectiveness of wav2vec 2.0 as a raw speech feature extractor for voice spoofing detection, demonstrating that optimized configurations can outperform traditional handcrafted features on benchmark datasets.

Contribution

It introduces a systematic analysis of wav2vec 2.0 layer selection and fine-tuning strategies for spoofing detection, achieving state-of-the-art results.

Findings

01

Wav2vec 2.0 features can surpass handcrafted features in spoofing detection.

02

Layer selection and fine-tuning significantly impact detection performance.

03

Optimal configurations achieve state-of-the-art results on ASVspoof 2019 LA dataset.

Abstract

Conventional spoofing detection systems have heavily relied on the use of handcrafted features derived from speech data. However, a notable shift has recently emerged towards the direct utilization of raw speech waveforms, as demonstrated by methods like SincNet filters. This shift underscores the demand for more sophisticated audio sample features. Moreover, the success of deep learning models, particularly those utilizing large pretrained wav2vec 2.0 as a featurization front-end, highlights the importance of refined feature encoders. In response, this research assessed the representational capability of wav2vec 2.0 as an audio feature extractor, modifying the size of its pretrained Transformer layers through two key adjustments: (1) selecting a subset of layers starting from the leftmost one and (2) fine-tuning a portion of the selected layers from the rightmost one. We complemented…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis