SONAR: Spectral-Contrastive Audio Residuals for Generalizable Deepfake Detection

Ido Nitzan HIdekel; Gal lifshitz; Khen Cohen; Dan Raviv

arXiv:2511.21325·cs.SD·November 27, 2025

SONAR: Spectral-Contrastive Audio Residuals for Generalizable Deepfake Detection

Ido Nitzan HIdekel, Gal lifshitz, Khen Cohen, Dan Raviv

PDF

Open Access

TL;DR

SONAR introduces a frequency-guided contrastive framework that disentangles and leverages high-frequency residuals in audio to improve deepfake detection, achieving state-of-the-art results and faster convergence.

Contribution

The paper proposes a novel spectral contrastive approach that explicitly disentangles low- and high-frequency audio features for enhanced deepfake detection.

Findings

01

Achieves state-of-the-art performance on ASVspoof 2021 and in-the-wild benchmarks.

02

Converges four times faster than strong baseline models.

03

Unveils a frequency-guided contrastive learning framework that is architecture-agnostic.

Abstract

Deepfake (DF) audio detectors still struggle to generalize to out of distribution inputs. A central reason is spectral bias, the tendency of neural networks to learn low-frequency structure before high-frequency (HF) details, which both causes DF generators to leave HF artifacts and leaves those same artifacts under-exploited by common detectors. To address this gap, we propose Spectral-cONtrastive Audio Residuals (SONAR), a frequency-guided framework that explicitly disentangles an audio signal into complementary representations. An XLSR encoder captures the dominant low-frequency content, while the same cloned path, preceded by learnable SRM, value-constrained high-pass filters, distills faint HF residuals. Frequency cross-attention reunites the two views for long- and short-range frequency dependencies, and a frequency-aware Jensen-Shannon contrastive loss pulls real content-noise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Speech and Audio Processing · Music and Audio Processing