Lightweight Resolution-Aware Audio Deepfake Detection via Cross-Scale Attention and Consistency Learning
K.A.Shahriar

TL;DR
This paper introduces a lightweight, resolution-aware audio deepfake detection method that uses cross-scale attention and consistency learning to improve robustness across various challenging conditions and datasets.
Contribution
It presents a novel multi-resolution spectral modeling framework with cross-scale attention and consistency learning, outperforming existing methods in robustness and efficiency.
Findings
Achieves near-perfect detection on ASVspoof LA (EER 0.16%)
Maintains high robustness across multiple datasets and conditions
Requires only 159k parameters and less than 1 GFLOP per inference
Abstract
Audio deepfake detection has become increasingly challenging due to rapid advances in speech synthesis and voice conversion technologies, particularly under channel distortions, replay attacks, and real-world recording conditions. This paper proposes a resolution-aware audio deepfake detection framework that explicitly models and aligns multi-resolution spectral representations through cross-scale attention and consistency learning. Unlike conventional single-resolution or implicit feature-fusion approaches, the proposed method enforces agreement across complementary time--frequency scales. The proposed framework is evaluated on three representative benchmarks: ASVspoof 2019 (LA and PA), the Fake-or-Real (FoR) dataset, and the In-the-Wild Audio Deepfake dataset under a speaker-disjoint protocol. The method achieves near-perfect performance on ASVspoof LA (EER 0.16%), strong robustness…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection · Speech Recognition and Synthesis · Generative Adversarial Networks and Image Synthesis
