Frame-to-Utterance Convergence: A Spectra-Temporal Approach for Unified Spoofing Detection
Awais Khan, Khalid Mahmood Malik, Shah Nawaz

TL;DR
This paper introduces a spectra-temporal fusion method using novel coefficients and an auto-encoder to detect various voice spoofing attacks, including synthetic, replay, and deepfake, across multiple datasets.
Contribution
It proposes a unified spectra-temporal approach with new coefficients and an auto-encoder to improve spoofing detection robustness across attack types.
Findings
Effective against synthetic, replay, and deepfake attacks
Robust performance on multiple benchmark datasets
Addresses spectral and temporal spoofing artifacts
Abstract
Voice spoofing attacks pose a significant threat to automated speaker verification systems. Existing anti-spoofing methods often simulate specific attack types, such as synthetic or replay attacks. However, in real-world scenarios, the countermeasures are unaware of the generation schema of the attack, necessitating a unified solution. Current unified solutions struggle to detect spoofing artifacts, especially with recent spoofing mechanisms. For instance, the spoofing algorithms inject spectral or temporal anomalies, which are challenging to identify. To this end, we present a spectra-temporal fusion leveraging frame-level and utterance-level coefficients. We introduce a novel local spectral deviation coefficient (SDC) for frame-level inconsistencies and employ a bi-LSTM-based network for sequential temporal coefficients (STC), which capture utterance-level artifacts. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Speech and Audio Processing
