SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake   Detection

Yi Zhu; Surya Koppisetti; Trang Tran; Gaurav Bharaj

arXiv:2407.18517·cs.SD·July 29, 2024·3 cites

SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection

Yi Zhu, Surya Koppisetti, Trang Tran, Gaurav Bharaj

PDF

Open Access 1 Video

TL;DR

This paper introduces SLIM, a novel audio deepfake detection model that leverages style-linguistics mismatch features learned through self-supervised pretraining, improving out-of-domain generalization and providing explainability.

Contribution

SLIM explicitly models style-linguistics mismatch using self-supervised learning, enhancing generalization and interpretability in audio deepfake detection.

Findings

01

Outperforms benchmark methods on out-of-domain datasets

02

Achieves competitive results on in-domain data

03

Provides explainable model decisions based on style-linguistics mismatch

Abstract

Audio deepfake detection (ADD) is crucial to combat the misuse of speech synthesized from generative AI models. Existing ADD models suffer from generalization issues, with a large performance discrepancy between in-domain and out-of-domain data. Moreover, the black-box nature of existing models limits their use in real-world scenarios, where explanations are required for model decisions. To alleviate these issues, we introduce a new ADD model that explicitly uses the StyleLInguistics Mismatch (SLIM) in fake speech to separate them from real speech. SLIM first employs self-supervised pretraining on only real samples to learn the style-linguistics dependency in the real class. The learned features are then used in complement with standard pretrained acoustic features (e.g., Wav2vec) to learn a classifier on the real and fake classes. When the feature encoders are frozen, SLIM outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing