Lightweight Resolution-Aware Audio Deepfake Detection via Cross-Scale Attention and Consistency Learning

K.A.Shahriar

arXiv:2601.06560·eess.AS·January 13, 2026

Lightweight Resolution-Aware Audio Deepfake Detection via Cross-Scale Attention and Consistency Learning

K.A.Shahriar

PDF

Open Access

TL;DR

This paper introduces a lightweight, resolution-aware audio deepfake detection method that uses cross-scale attention and consistency learning to improve robustness across various challenging conditions and datasets.

Contribution

It presents a novel multi-resolution spectral modeling framework with cross-scale attention and consistency learning, outperforming existing methods in robustness and efficiency.

Findings

01

Achieves near-perfect detection on ASVspoof LA (EER 0.16%)

02

Maintains high robustness across multiple datasets and conditions

03

Requires only 159k parameters and less than 1 GFLOP per inference

Abstract

Audio deepfake detection has become increasingly challenging due to rapid advances in speech synthesis and voice conversion technologies, particularly under channel distortions, replay attacks, and real-world recording conditions. This paper proposes a resolution-aware audio deepfake detection framework that explicitly models and aligns multi-resolution spectral representations through cross-scale attention and consistency learning. Unlike conventional single-resolution or implicit feature-fusion approaches, the proposed method enforces agreement across complementary time--frequency scales. The proposed framework is evaluated on three representative benchmarks: ASVspoof 2019 (LA and PA), the Fake-or-Real (FoR) dataset, and the In-the-Wild Audio Deepfake dataset under a speaker-disjoint protocol. The method achieves near-perfect performance on ASVspoof LA (EER 0.16%), strong robustness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Media Forensic Detection · Speech Recognition and Synthesis · Generative Adversarial Networks and Image Synthesis