Frequency-mix Knowledge Distillation for Fake Speech Detection

Cunhang Fan; Shunbo Dong; Jun Xue; Yujie Chen; Jiangyan Yi; Zhao Lv

arXiv:2406.09664·cs.SD·June 17, 2024

Frequency-mix Knowledge Distillation for Fake Speech Detection

Cunhang Fan, Shunbo Dong, Jun Xue, Yujie Chen, Jiangyan Yi, Zhao Lv

PDF

Open Access

TL;DR

This paper introduces Frequency-mix knowledge distillation (FKD), a novel data augmentation and model training method that significantly improves fake speech detection accuracy in telephony scenarios by combining frequency and time domain techniques.

Contribution

The paper proposes a new Frequency-mix data augmentation method and a multi-level feature distillation approach to enhance fake speech detection models' generalization and information retention.

Findings

01

Achieves 31% improvement over baseline on ASVspoof 2021 LA dataset.

02

Performs competitively on ASVspoof 2021 DF dataset.

03

Introduces a novel combination of frequency and time domain data augmentation.

Abstract

In the telephony scenarios, the fake speech detection (FSD) task to combat speech spoofing attacks is challenging. Data augmentation (DA) methods are considered effective means to address the FSD task in telephony scenarios, typically divided into time domain and frequency domain stages. While each has its advantages, both can result in information loss. To tackle this issue, we propose a novel DA method, Frequency-mix (Freqmix), and introduce the Freqmix knowledge distillation (FKD) to enhance model information extraction and generalization abilities. Specifically, we use Freqmix-enhanced data as input for the teacher model, while the student model's input undergoes time-domain DA method. We use a multi-level feature distillation approach to restore information and improve the model's generalization capabilities. Our approach achieves state-of-the-art results on ASVspoof 2021 LA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Digital Media Forensic Detection