Unsupervised Domain Adaptation for Audio Deepfake Detection with Modular Statistical Transformations
Urawee Thani, Gagandeep Singh, Priyanka Singh

TL;DR
This paper introduces a modular unsupervised domain adaptation pipeline for audio deepfake detection that enhances cross-domain generalization using statistical transformations and feature alignment techniques.
Contribution
The proposed approach combines statistical feature normalization, selection, and alignment methods with pre-trained embeddings to improve cross-domain audio deepfake detection without labeled target data.
Findings
Achieved 62.7-63.6% accuracy in cross-domain scenarios.
Feature selection and CORAL alignment significantly boost performance.
Complete pipeline improves accuracy by 10.7% over baseline.
Abstract
Audio deepfake detection systems trained on one dataset often fail when deployed on data from different sources due to distributional shifts in recording conditions, synthesis methods, and acoustic environments. We present a modular pipeline for unsupervised domain adaptation that combines pre-trained Wav2Vec 2.0 embeddings with statistical transformations to improve cross-domain generalization without requiring labeled target data. Our approach applies power transformation for feature normalization, ANOVA-based feature selection, joint PCA for domain-agnostic dimensionality reduction, and CORAL alignment to match source and target covariance structures before classification via logistic regression. We evaluate on two cross-domain transfer scenarios: ASVspoof 2019 LA to Fake-or-Real (FoR) and FoR to ASVspoof, achieving 62.7--63.6\% accuracy with balanced performance across real and fake…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Generative Adversarial Networks and Image Synthesis · Speech Recognition and Synthesis
