DiffAnon: Diffusion-based Prosody Control for Voice Anonymization
Ismail Rasim Ulgen, Zexin Cai, Nicholas Andrews, Philipp Koehn, Berrak Sisman

TL;DR
DiffAnon introduces a diffusion-based voice anonymization method that allows explicit, continuous control over prosody preservation, balancing privacy and utility during inference.
Contribution
It is the first voice anonymization framework to offer structured, interpolatable inference-time control over prosody preservation using diffusion models.
Findings
Achieves strong utility while maintaining competitive privacy.
Enables smooth interpolation between anonymization strength and prosodic fidelity.
Provides structured trade-off behavior across different operating points.
Abstract
To preserve or not to preserve prosody is a central question in voice anonymization. Prosody conveys meaning and affect, yet is tightly coupled with speaker identity. Existing methods either discard prosody for privacy or lack a principled mechanism to control the utility-privacy trade-off, operating at fixed design points. We propose DiffAnon, a diffusion-based anonymization method with classifier-free guidance (CFG) that provides explicit, continuous inference-time control over prosody preservation. DiffAnon refines acoustic detail over semantic embeddings of an RVQ codec, enabling smooth interpolation between anonymization strength and prosodic fidelity within a single model. To the best of our knowledge, it is the first voice anonymization framework to provide structured, interpolatable inference-time prosody control. Experiments demonstrate structured trade-off behavior, achieving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
