HierCon: Hierarchical Contrastive Attention for Audio Deepfake Detection
Zhili Nicholas Liang, Soyeon Caren Han, Qizhou Wang, Christopher Leckie

TL;DR
HierCon introduces a hierarchical attention framework with contrastive learning to improve detection of audio deepfakes by modeling dependencies across layers and time, achieving state-of-the-art results and better cross-domain generalization.
Contribution
The paper presents HierCon, a novel hierarchical layer attention model with contrastive learning that captures dependencies across layers and time for more effective audio deepfake detection.
Findings
Achieves state-of-the-art EER of 1.93% on ASVspoof 2021 DF dataset.
Improves detection robustness across different domains and recording conditions.
Hierarchical modeling enhances generalization over independent layer approaches.
Abstract
Audio deepfakes generated by modern TTS and voice conversion systems are increasingly difficult to distinguish from real speech, raising serious risks for security and online trust. While state-of-the-art self-supervised models provide rich multi-layer representations, existing detectors treat layers independently and overlook temporal and hierarchical dependencies critical for identifying synthetic artefacts. We propose HierCon, a hierarchical layer attention framework combined with margin-based contrastive learning that models dependencies across temporal frames, neighbouring layers, and layer groups, while encouraging domain-invariant embeddings. Evaluated on ASVspoof 2021 DF and In-the-Wild datasets, our method achieves state-of-the-art performance (1.93% and 6.87% EER), improving over independent layer weighting by 36.6% and 22.5% respectively. The results and attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Speech Recognition and Synthesis · Digital Media Forensic Detection
