DIN-CTS: Low-Complexity Depthwise-Inception Neural Network with   Contrastive Training Strategy for Deepfake Speech Detection

Lam Pham; Dat Tran; Phat Lam; Florian Skopik; Alexander Schindler,; Silvia Poletti; David Fischinger; Martin Boyer

arXiv:2502.20225·cs.SD·April 1, 2025

DIN-CTS: Low-Complexity Depthwise-Inception Neural Network with Contrastive Training Strategy for Deepfake Speech Detection

Lam Pham, Dat Tran, Phat Lam, Florian Skopik, Alexander Schindler,, Silvia Poletti, David Fischinger, Martin Boyer

PDF

Open Access

TL;DR

This paper introduces a low-complexity neural network with contrastive training for effective deepfake speech detection, achieving high accuracy and outperforming existing methods on a benchmark dataset.

Contribution

The novel combination of a Depthwise-Inception Network with contrastive training strategy for efficient and accurate deepfake speech detection.

Findings

01

Achieved 4.6% EER on ASVspoof 2019 LA dataset.

02

Outperformed single-system submissions in the challenge.

03

Operates with only 1.77 million parameters and 985 million FLOPS.

Abstract

In this paper, we propose a deep neural network approach for deepfake speech detection (DSD) based on a lowcomplexity Depthwise-Inception Network (DIN) trained with a contrastive training strategy (CTS). In this framework, input audio recordings are first transformed into spectrograms using Short-Time Fourier Transform (STFT) and Linear Filter (LF), which are then used to train the DIN. Once trained, the DIN processes bonafide utterances to extract audio embeddings, which are used to construct a Gaussian distribution representing genuine speech. Deepfake detection is then performed by computing the distance between a test utterance and this distribution to determine whether the utterance is fake or bonafide. To evaluate our proposed systems, we conducted extensive experiments on the benchmark dataset of ASVspoof 2019 LA. The experimental results demonstrate the effectiveness of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Digital Media Forensic Detection