SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise

Rui Sang; Yuxuan Liu

arXiv:2511.16114·cs.SD·November 21, 2025

SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise

Rui Sang, Yuxuan Liu

PDF

Open Access 1 Video

TL;DR

SceneGuard introduces scene-consistent audible background noise during training to protect against voice cloning, significantly reducing speaker similarity while maintaining speech intelligibility and robustness against common audio countermeasures.

Contribution

It proposes a novel training-time voice protection method using natural acoustic scenes, improving robustness over existing imperceptible perturbation techniques.

Findings

01

Achieves 5.5% reduction in speaker similarity with high statistical significance.

02

Maintains 98.6% speech intelligibility despite protection measures.

03

Robust against MP3 compression, spectral subtraction, lowpass filtering, and downsampling.

Abstract

Voice cloning technology poses significant privacy threats by enabling unauthorized speech synthesis from limited audio samples. Existing defenses based on imperceptible adversarial perturbations are vulnerable to common audio preprocessing such as denoising and compression. We propose SceneGuard, a training-time voice protection method that applies scene-consistent audible background noise to speech recordings. Unlike imperceptible perturbations, SceneGuard leverages naturally occurring acoustic scenes (e.g., airport, street, park) to create protective noise that is contextually appropriate and robust to countermeasures. We evaluate SceneGuard on text-to-speech training attacks, demonstrating 5.5% speaker similarity degradation with extremely high statistical significance (p < 10^{-15}, Cohen's d = 2.18) while preserving 98.6% speech intelligibility (STOI = 0.986). Robustness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Adversarial Robustness in Machine Learning