MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech
Yutong Jin, Qi Li, Lingshuang Liu, Jianbing Ni

TL;DR
MelShield is a novel audio watermarking framework that embeds identifiable signals into AI-generated speech in the Mel-spectrogram domain, enabling reliable attribution and copyright protection without retraining TTS models.
Contribution
It introduces a plug-and-play, keyed spread-spectrum watermarking method operating during speech synthesis, enhancing robustness and scalability for AI-generated audio attribution.
Findings
Achieves near 100% watermark extraction accuracy under distortions.
Maintains high perceptual audio quality after watermark embedding.
Does not require retraining of existing TTS vocoders.
Abstract
In this paper, we propose MelShield, a robust, in-generation, keyed audio watermarking framework that embeds identifiable signals into AI-generated audio for copyright protection and reliable attribution. Specifically, MelShield operates in the Mel-spectrogram domain during the generation process, targeting intermediate acoustic representations in Mel-conditioned pipelines for text-to-speech (TTS) generation. The core idea is to treat the intermediate Mel-spectrogram as the host signal and embed a short binary payload via low-energy, keyed spread-spectrum perturbations distributed across carefully selected time-frequency regions prior to waveform synthesis. By performing watermarking before vocoder inference, MelShield remains plug-and-play for Mel-conditioned TTS architectures and does not require modification or retraining of the underlying TTS generation vocoder, such as DiffWave and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
