BreathNet: Generalizable Audio Deepfake Detection via Breath-Cue-Guided Feature Refinement
Zhe Ye, Xiangui Kang, Jiayi He, Chengxin Chen, Wei Zhu, Kai Wu, Yin Yang, Jiwu Huang

TL;DR
BreathNet is a novel deepfake audio detection framework that leverages fine-grained breath cues and spectral features, combined with specialized loss functions, to achieve state-of-the-art generalization across multiple benchmarks.
Contribution
The paper introduces BreathNet, integrating breath-related cues via BreathFiLM and a fusion of temporal and spectral features, along with a new set of feature losses for improved deepfake detection.
Findings
Achieves 1.99% average EER on four benchmarks.
Outperforms existing methods on the In-the-Wild dataset.
Attains 4.94% EER on the latest ASVspoof5 benchmark.
Abstract
As deepfake audio becomes more realistic and diverse, developing generalizable countermeasure systems has become crucial. Existing detection methods primarily depend on XLS-R front-end features to improve generalization. Nonetheless, their performance remains limited, partly due to insufficient attention to fine-grained information, such as physiological cues or frequency-domain features. In this paper, we propose BreathNet, a novel audio deepfake detection framework that integrates fine-grained breath information to improve generalization. Specifically, we design BreathFiLM, a feature-wise linear modulation mechanism that selectively amplifies temporal representations based on the presence of breathing sounds. BreathFiLM is trained jointly with the XLS-R extractor, in turn encouraging the extractor to learn and encode breath-related cues into the temporal features. Then, we use the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Emotion and Mood Recognition · Speech Recognition and Synthesis
