SONAR: Spectral-Contrastive Audio Residuals for Generalizable Deepfake Detection
Ido Nitzan HIdekel, Gal lifshitz, Khen Cohen, Dan Raviv

TL;DR
SONAR introduces a frequency-guided contrastive framework that disentangles and leverages high-frequency residuals in audio to improve deepfake detection, achieving state-of-the-art results and faster convergence.
Contribution
The paper proposes a novel spectral contrastive approach that explicitly disentangles low- and high-frequency audio features for enhanced deepfake detection.
Findings
Achieves state-of-the-art performance on ASVspoof 2021 and in-the-wild benchmarks.
Converges four times faster than strong baseline models.
Unveils a frequency-guided contrastive learning framework that is architecture-agnostic.
Abstract
Deepfake (DF) audio detectors still struggle to generalize to out of distribution inputs. A central reason is spectral bias, the tendency of neural networks to learn low-frequency structure before high-frequency (HF) details, which both causes DF generators to leave HF artifacts and leaves those same artifacts under-exploited by common detectors. To address this gap, we propose Spectral-cONtrastive Audio Residuals (SONAR), a frequency-guided framework that explicitly disentangles an audio signal into complementary representations. An XLSR encoder captures the dominant low-frequency content, while the same cloned path, preceded by learnable SRM, value-constrained high-pass filters, distills faint HF residuals. Frequency cross-attention reunites the two views for long- and short-range frequency dependencies, and a frequency-aware Jensen-Shannon contrastive loss pulls real content-noise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Speech and Audio Processing · Music and Audio Processing
