Erasing Your Voice Before It's Heard: Training-free Speaker Unlearning for Zero-shot Text-to-Speech
Myungjin Lee, Eunji Shin, Jiyoung Lee

TL;DR
TruS is a training-free framework for speaker unlearning in zero-shot TTS, enabling suppression of specific voices during inference without retraining, thus enhancing privacy and safety.
Contribution
It introduces a novel inference-time control method for speaker unlearning that works without retraining, applicable to unseen speakers.
Findings
Effectively prevents voice synthesis of targeted speakers.
Works on both seen and unseen speakers.
Maintains other speech attributes like prosody and emotion.
Abstract
Modern zero-shot text-to-speech (TTS) models offer unprecedented expressivity but also pose serious crime risks, as they can synthesize voices of individuals who never consented. In this context, speaker unlearning aims to prevent the generation of specific speaker identities upon request. Existing approaches, reliant on retraining, are costly and limited to speakers seen in the training set. We present TruS, a training-free speaker unlearning framework that shifts the paradigm from data deletion to inference-time control. TruS steers identity-specific hidden activations to suppress target speakers while preserving other attributes (e.g., prosody and emotion). Experimental results show that TruS effectively prevents voice generation on both seen and unseen opt-out speakers, establishing a scalable safeguard for speech synthesis. The demo and code are available on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Hate Speech and Cyberbullying Detection · Mental Health via Writing
