SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise
Rui Sang, Yuxuan Liu

TL;DR
SceneGuard introduces scene-consistent audible background noise during training to protect against voice cloning, significantly reducing speaker similarity while maintaining speech intelligibility and robustness against common audio countermeasures.
Contribution
It proposes a novel training-time voice protection method using natural acoustic scenes, improving robustness over existing imperceptible perturbation techniques.
Findings
Achieves 5.5% reduction in speaker similarity with high statistical significance.
Maintains 98.6% speech intelligibility despite protection measures.
Robust against MP3 compression, spectral subtraction, lowpass filtering, and downsampling.
Abstract
Voice cloning technology poses significant privacy threats by enabling unauthorized speech synthesis from limited audio samples. Existing defenses based on imperceptible adversarial perturbations are vulnerable to common audio preprocessing such as denoising and compression. We propose SceneGuard, a training-time voice protection method that applies scene-consistent audible background noise to speech recordings. Unlike imperceptible perturbations, SceneGuard leverages naturally occurring acoustic scenes (e.g., airport, street, park) to create protective noise that is contextually appropriate and robust to countermeasures. We evaluate SceneGuard on text-to-speech training attacks, demonstrating 5.5% speaker similarity degradation with extremely high statistical significance (p < 10^{-15}, Cohen's d = 2.18) while preserving 98.6% speech intelligibility (STOI = 0.986). Robustness…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Adversarial Robustness in Machine Learning
