Multi-Granularity Adaptive Time-Frequency Attention Framework for Audio Deepfake Detection under Real-World Communication Degradations
Haohan Shi, Xiyu Shi, Safak Dogan, Tianjin Huang, Yunxiao Zhang

TL;DR
This paper introduces a robust audio deepfake detection framework that employs multi-granularity adaptive attention to effectively identify fake audio under real-world communication degradations like packet loss and speech codec compression.
Contribution
The paper presents the first unified framework with a novel multi-granularity adaptive attention architecture for robust audio deepfake detection in degraded communication environments.
Findings
Outperforms state-of-the-art methods across various communication degradations
Enhances feature separability between real and fake audio
Improves detection robustness under multiple real-world conditions
Abstract
The rise of highly convincing synthetic speech poses a growing threat to audio communications. Although existing Audio Deepfake Detection (ADD) methods have demonstrated good performance under clean conditions, their effectiveness drops significantly under degradations such as packet losses and speech codec compression in real-world communication environments. In this work, we propose the first unified framework for robust ADD under such degradations, which is designed to effectively accommodate multiple types of Time-Frequency (TF) representations. The core of our framework is a novel Multi-Granularity Adaptive Attention (MGAA) architecture, which employs a set of customizable multi-scale attention heads to capture both global and local receptive fields across varying TF granularities. A novel adaptive fusion mechanism subsequently adjusts and fuses these attention branches based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Generative Adversarial Networks and Image Synthesis
