Audio Deepfake Detection at the First Greeting: "Hi!"
Haohan Shi, Xiyu Shi, Safak Dogan, Tianjin Huang, Yunxiao Zhang

TL;DR
This paper introduces S-MGAA, a lightweight model for detecting audio deepfakes in very short speech segments, emphasizing robustness and efficiency for real-time communication scenarios.
Contribution
It proposes a novel extension of Multi-Granularity Adaptive Time-Frequency Attention with modules tailored for short, degraded audio inputs, improving detection accuracy and efficiency.
Findings
S-MGAA outperforms nine state-of-the-art baselines.
It demonstrates robustness to communication degradations.
It offers a favorable efficiency-accuracy trade-off for real-time deployment.
Abstract
This paper focuses on audio deepfake detection under real-world communication degradations, with an emphasis on ultra-short inputs (0.5-2.0s), targeting the capability to detect synthetic speech at a conversation opening, e.g., when a scammer says "Hi." We propose Short-MGAA (S-MGAA), a novel lightweight extension of Multi-Granularity Adaptive Time-Frequency Attention, designed to enhance discriminative representation learning for short, degraded inputs subjected to communication processing and perturbations. The S-MGAA integrates two tailored modules: a Pixel-Channel Enhanced Module (PCEM) that amplifies fine-grained time-frequency saliency, and a Frequency Compensation Enhanced Module (FCEM) to supplement limited temporal evidence via multi-scale frequency modeling and adaptive frequency-temporal interaction. Extensive experiments demonstrate that S-MGAA consistently surpasses nine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
