SNAP: Speaker Nulling for Artifact Projection in Speech Deepfake Detection
Kyudan Jung, Jihwan Kim, Minwoo Lee, Soyoon Kim, Jeonghoon Kim, Jaegul Choo, Cheonbok Park

TL;DR
SNAP is a novel framework that reduces speaker information in speech deepfake detection models, enabling better focus on artifacts and improving detection accuracy across unseen speakers.
Contribution
This paper introduces SNAP, a speaker-nulling method that isolates synthesis artifacts by removing speaker-dependent features, enhancing deepfake detection robustness.
Findings
SNAP achieves state-of-the-art detection performance.
Reducing speaker entanglement improves generalization to unseen speakers.
Orthogonal projection effectively suppresses speaker-specific information.
Abstract
Recent advancements in text-to-speech technologies enable generating high-fidelity synthetic speech nearly indistinguishable from real human voices. While recent studies show the efficacy of self-supervised learning-based speech encoders for deepfake detection, these models struggle to generalize across unseen speakers. Our quantitative analysis suggests these encoder representations are substantially influenced by speaker information, causing detectors to exploit speaker-specific correlations rather than artifact-related cues. We call this phenomenon speaker entanglement. To mitigate this reliance, we introduce SNAP, a speaker-nulling framework. We estimate a speaker subspace and apply orthogonal projection to suppress speaker-dependent components, isolating synthesis artifacts within the residual features. By reducing speaker entanglement, SNAP encourages detectors to focus on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
