Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection
Zihan Pan, Tianchi Liu, Hardik B. Sailor, Qiongqiong Wang

TL;DR
This paper explores how hierarchical embeddings from a pre-trained speech model can be effectively merged using an attentive method to improve anti-spoofing detection accuracy, achieving state-of-the-art results.
Contribution
It introduces an attentive merging technique for multi-layer embeddings of WavLM, enhancing anti-spoofing detection performance and revealing the importance of early transformer layers.
Findings
Achieved EERs of 0.65%, 3.50%, and 3.19% on three ASVspoof datasets.
Early hidden layers of WavLM are highly effective for anti-spoofing.
Partial pre-trained models can maintain high performance with improved efficiency.
Abstract
Self-supervised learning (SSL) speech representation models, trained on large speech corpora, have demonstrated effectiveness in extracting hierarchical speech embeddings through multiple transformer layers. However, the behavior of these embeddings in specific tasks remains uncertain. This paper investigates the multi-layer behavior of the WavLM model in anti-spoofing and proposes an attentive merging method to leverage the hierarchical hidden embeddings. Results demonstrate the feasibility of fine-tuning WavLM to achieve the best equal error rate (EER) of 0.65%, 3.50%, and 3.19% on the ASVspoof 2019LA, 2021LA, and 2021DF evaluation sets, respectively. Notably, We find that the early hidden transformer layers of the WavLM large model contribute significantly to anti-spoofing task, enabling computational efficiency by utilizing a partial pre-trained model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLinguistics and Cultural Studies · Speech Recognition and Synthesis · Infant Health and Development
