Genuine-Focused Learning using Mask AutoEncoder for Generalized Fake   Audio Detection

Xiaopeng Wang; Ruibo Fu; Zhengqi Wen; Zhiyong Wang; Yuankun Xie; Yukun; Liu; Jianhua Tao; Xuefei Liu; Yongwei Li; Xin Qi; Yi Lu; Shuchen Shi

arXiv:2406.03247·cs.SD·June 11, 2024

Genuine-Focused Learning using Mask AutoEncoder for Generalized Fake Audio Detection

Xiaopeng Wang, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Yuankun Xie, Yukun, Liu, Jianhua Tao, Xuefei Liu, Yongwei Li, Xin Qi, Yi Lu, Shuchen Shi

PDF

Open Access

TL;DR

This paper introduces GFL-FAD, a novel genuine-focused learning framework utilizing Mask AutoEncoder and counterfactual reasoning to enhance the generalization of fake audio detection, achieving state-of-the-art results.

Contribution

It proposes a new GFL-FAD framework that models genuine audio features with CRER and MAE, improving robustness against unseen spoofing techniques.

Findings

01

Achieves an EER of 0.25% on ASVspoof2019 LA

02

Outperforms existing methods in generalization to new spoofing attacks

03

Introduces a genuine audio reconstruction loss to focus on genuine features

Abstract

The generalization of Fake Audio Detection (FAD) is critical due to the emergence of new spoofing techniques. Traditional FAD methods often focus solely on distinguishing between genuine and known spoofed audio. We propose a Genuine-Focused Learning (GFL) framework guided, aiming for highly generalized FAD, called GFL-FAD. This method incorporates a Counterfactual Reasoning Enhanced Representation (CRER) based on audio reconstruction using the Mask AutoEncoder (MAE) architecture to accurately model genuine audio features. To reduce the influence of spoofed audio during training, we introduce a genuine audio reconstruction loss, maintaining the focus on learning genuine data features. In addition, content-related bottleneck (BN) features are extracted from the MAE to supplement the knowledge of the original audio. These BN features are adaptively fused with CRER to further improve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Media Forensic Detection · Music and Audio Processing