HierCon: Hierarchical Contrastive Attention for Audio Deepfake Detection

Zhili Nicholas Liang; Soyeon Caren Han; Qizhou Wang; Christopher Leckie

arXiv:2602.01032·cs.SD·February 3, 2026

HierCon: Hierarchical Contrastive Attention for Audio Deepfake Detection

Zhili Nicholas Liang, Soyeon Caren Han, Qizhou Wang, Christopher Leckie

PDF

Open Access

TL;DR

HierCon introduces a hierarchical attention framework with contrastive learning to improve detection of audio deepfakes by modeling dependencies across layers and time, achieving state-of-the-art results and better cross-domain generalization.

Contribution

The paper presents HierCon, a novel hierarchical layer attention model with contrastive learning that captures dependencies across layers and time for more effective audio deepfake detection.

Findings

01

Achieves state-of-the-art EER of 1.93% on ASVspoof 2021 DF dataset.

02

Improves detection robustness across different domains and recording conditions.

03

Hierarchical modeling enhances generalization over independent layer approaches.

Abstract

Audio deepfakes generated by modern TTS and voice conversion systems are increasingly difficult to distinguish from real speech, raising serious risks for security and online trust. While state-of-the-art self-supervised models provide rich multi-layer representations, existing detectors treat layers independently and overlook temporal and hierarchical dependencies critical for identifying synthetic artefacts. We propose HierCon, a hierarchical layer attention framework combined with margin-based contrastive learning that models dependencies across temporal frames, neighbouring layers, and layer groups, while encouraging domain-invariant embeddings. Evaluated on ASVspoof 2021 DF and In-the-Wild datasets, our method achieves state-of-the-art performance (1.93% and 6.87% EER), improving over independent layer weighting by 36.6% and 22.5% respectively. The results and attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Speech Recognition and Synthesis · Digital Media Forensic Detection