SNAP: Speaker Nulling for Artifact Projection in Speech Deepfake Detection

Kyudan Jung; Jihwan Kim; Minwoo Lee; Soyoon Kim; Jeonghoon Kim; Jaegul Choo; Cheonbok Park

arXiv:2603.20686·cs.SD·March 24, 2026

SNAP: Speaker Nulling for Artifact Projection in Speech Deepfake Detection

Kyudan Jung, Jihwan Kim, Minwoo Lee, Soyoon Kim, Jeonghoon Kim, Jaegul Choo, Cheonbok Park

PDF

Open Access

TL;DR

SNAP is a novel framework that reduces speaker information in speech deepfake detection models, enabling better focus on artifacts and improving detection accuracy across unseen speakers.

Contribution

This paper introduces SNAP, a speaker-nulling method that isolates synthesis artifacts by removing speaker-dependent features, enhancing deepfake detection robustness.

Findings

01

SNAP achieves state-of-the-art detection performance.

02

Reducing speaker entanglement improves generalization to unseen speakers.

03

Orthogonal projection effectively suppresses speaker-specific information.

Abstract

Recent advancements in text-to-speech technologies enable generating high-fidelity synthetic speech nearly indistinguishable from real human voices. While recent studies show the efficacy of self-supervised learning-based speech encoders for deepfake detection, these models struggle to generalize across unseen speakers. Our quantitative analysis suggests these encoder representations are substantially influenced by speaker information, causing detectors to exploit speaker-specific correlations rather than artifact-related cues. We call this phenomenon speaker entanglement. To mitigate this reliance, we introduce SNAP, a speaker-nulling framework. We estimate a speaker subspace and apply orthogonal projection to suppress speaker-dependent components, isolating synthesis artifacts within the residual features. By reducing speaker entanglement, SNAP encourages detectors to focus on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing