Classical Machine Learning Baselines for Deepfake Audio Detection on the Fake-or-Real Dataset
Faheem Ahmad, Ajan Ahmed, Masudul Imtiaz

TL;DR
This paper develops and evaluates simple, interpretable classical machine learning models using acoustic features to detect deepfake audio, providing a transparent baseline for future research.
Contribution
It introduces a classical machine learning baseline with detailed feature analysis and statistical validation for deepfake audio detection on the Fake-or-Real dataset.
Findings
RBF SVM achieves ~93% accuracy and 7% EER.
Spectral features like spectral centroid and bandwidth are key discriminative cues.
Linear models reach around 75% accuracy.
Abstract
Deep learning has enabled highly realistic synthetic speech, raising concerns about fraud, impersonation, and disinformation. Despite rapid progress in neural detectors, transparent baselines are needed to reveal which acoustic cues reliably separate real from synthetic speech. This paper presents an interpretable classical machine learning baseline for deepfake audio detection using the Fake-or-Real (FoR) dataset. We extract prosodic, voice-quality, and spectral features from two-second clips at 44.1 kHz (high-fidelity) and 16 kHz (telephone-quality) sampling rates. Statistical analysis (ANOVA, correlation heatmaps) identifies features that differ significantly between real and fake speech. We then train multiple classifiers -- Logistic Regression, LDA, QDA, Gaussian Naive Bayes, SVMs, and GMMs -- and evaluate performance using accuracy, ROC-AUC, EER, and DET curves. Pairwise McNemar's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
