Smark: A Watermark for Text-to-Speech Diffusion Models via Discrete Wavelet Transform
Yichuan Zhang, Chengxin Li, Yujie Gu

TL;DR
Smark introduces a universal watermarking method for TTS diffusion models that embeds watermarks into low-frequency audio regions using DWT, ensuring high audio quality and robustness against removal.
Contribution
The paper presents a lightweight, model-agnostic watermarking scheme for TTS diffusion models utilizing DWT to embed watermarks without degrading audio quality.
Findings
Smark maintains high audio quality across various models.
Watermarks are robust against common attack scenarios.
The method achieves high accuracy in watermark extraction.
Abstract
Text-to-Speech (TTS) diffusion models generate high-quality speech, which raises challenges for the model intellectual property protection and speech tracing for legal use. Audio watermarking is a promising solution. However, due to the structural differences among various TTS diffusion models, existing watermarking methods are often designed for a specific model and degrade audio quality, which limits their practical applicability. To address this dilemma, this paper proposes a universal watermarking scheme for TTS diffusion models, termed Smark. This is achieved by designing a lightweight watermark embedding framework that operates in the common reverse diffusion paradigm shared by all TTS diffusion models. To mitigate the impact on audio quality, Smark utilizes the discrete wavelet transform (DWT) to embed watermarks into the relatively stable low-frequency regions of the audio,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Steganography and Watermarking Techniques · Digital Media Forensic Detection · Music and Audio Processing
