Generalized Spoofing Detection Inspired from Audio Generation Artifacts
Yang Gao, Tyler Vuong, Mahsa Elyasi, Gaurav Bharaj, Rita Singh

TL;DR
This paper introduces a novel 2D DCT spectro-temporal feature for audio deepfake detection, outperforming existing features and achieving state-of-the-art results by capturing artifacts in the frequency domain.
Contribution
The paper proposes a new 2D DCT feature for spoofing detection, combined with CNN, improving detection accuracy and generalization over previous methods.
Findings
Achieved a 14% reduction in t-DCF score over previous top systems.
Demonstrated the effectiveness of the 2D DCT feature over traditional features.
Validated the model's generalization on external datasets.
Abstract
State-of-the-art methods for audio generation suffer from fingerprint artifacts and repeated inconsistencies across temporal and spectral domains. Such artifacts could be well captured by the frequency domain analysis over the spectrogram. Thus, we propose a novel use of long-range spectro-temporal modulation feature -- 2D DCT over log-Mel spectrogram for the audio deepfake detection. We show that this feature works better than log-Mel spectrogram, CQCC, MFCC, as a suitable candidate to capture such artifacts. We employ spectrum augmentation and feature normalization to decrease overfitting and bridge the gap between training and test dataset along with this novel feature introduction. We developed a CNN-based baseline that achieved a 0.0849 t-DCF and outperformed the previously top single systems reported in the ASVspoof 2019 challenge. Finally, by combining our baseline with our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Digital Media Forensic Detection
