DeepFense: A Unified, Modular, and Extensible Framework for Robust Deepfake Audio Detection
Yassine El Kheir, Arnab Das, Yixuan Xiao, Xin Wang, Feidi Kallel, Enes Erdem Erdogan, Ngoc Thang Vu, Tim Polzehl, Sebastian Moeller

TL;DR
DeepFense is an open-source PyTorch toolkit for robust deepfake audio detection, enabling standardized evaluation, extensive benchmarking, and addressing biases in model performance.
Contribution
It introduces a comprehensive framework with diverse architectures and recipes, facilitating reproducibility and large-scale evaluation of deepfake audio detection models.
Findings
Carefully curated training data improves cross-domain generalization.
Pre-trained front-end feature extractors significantly influence performance.
Models exhibit biases related to audio quality, gender, and language.
Abstract
Speech deepfake detection is a well-established research field with different models, datasets, and training strategies. However, the lack of standardized implementations and evaluation protocols limits reproducibility, benchmarking, and comparison across studies. In this work, we present DeepFense, a comprehensive, open-source PyTorch toolkit integrating the latest architectures, loss functions, and augmentation pipelines, alongside over 100 recipes. Using DeepFense, we conducted a large-scale evaluation of more than 400 models. Our findings reveal that while carefully curated training data improves cross-domain generalization, the choice of pre-trained front-end feature extractor dominates overall performance variance. Crucially, we show severe biases in high-performing models regarding audio quality, speaker gender, and language. DeepFense is expected to facilitate real-world…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
