Multi-Granularity Adaptive Time-Frequency Attention Framework for Audio Deepfake Detection under Real-World Communication Degradations

Haohan Shi; Xiyu Shi; Safak Dogan; Tianjin Huang; Yunxiao Zhang

arXiv:2508.01467·eess.AS·August 5, 2025

Multi-Granularity Adaptive Time-Frequency Attention Framework for Audio Deepfake Detection under Real-World Communication Degradations

Haohan Shi, Xiyu Shi, Safak Dogan, Tianjin Huang, Yunxiao Zhang

PDF

Open Access

TL;DR

This paper introduces a robust audio deepfake detection framework that employs multi-granularity adaptive attention to effectively identify fake audio under real-world communication degradations like packet loss and speech codec compression.

Contribution

The paper presents the first unified framework with a novel multi-granularity adaptive attention architecture for robust audio deepfake detection in degraded communication environments.

Findings

01

Outperforms state-of-the-art methods across various communication degradations

02

Enhances feature separability between real and fake audio

03

Improves detection robustness under multiple real-world conditions

Abstract

The rise of highly convincing synthetic speech poses a growing threat to audio communications. Although existing Audio Deepfake Detection (ADD) methods have demonstrated good performance under clean conditions, their effectiveness drops significantly under degradations such as packet losses and speech codec compression in real-world communication environments. In this work, we propose the first unified framework for robust ADD under such degradations, which is designed to effectively accommodate multiple types of Time-Frequency (TF) representations. The core of our framework is a novel Multi-Granularity Adaptive Attention (MGAA) architecture, which employs a set of customizable multi-scale attention heads to capture both global and local receptive fields across varying TF granularities. A novel adaptive fusion mechanism subsequently adjusts and fuses these attention branches based on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Generative Adversarial Networks and Image Synthesis