MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech

Yutong Jin; Qi Li; Lingshuang Liu; Jianbing Ni

arXiv:2605.01515·cs.SD·May 5, 2026

MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech

Yutong Jin, Qi Li, Lingshuang Liu, Jianbing Ni

PDF

TL;DR

MelShield is a novel audio watermarking framework that embeds identifiable signals into AI-generated speech in the Mel-spectrogram domain, enabling reliable attribution and copyright protection without retraining TTS models.

Contribution

It introduces a plug-and-play, keyed spread-spectrum watermarking method operating during speech synthesis, enhancing robustness and scalability for AI-generated audio attribution.

Findings

01

Achieves near 100% watermark extraction accuracy under distortions.

02

Maintains high perceptual audio quality after watermark embedding.

03

Does not require retraining of existing TTS vocoders.

Abstract

In this paper, we propose MelShield, a robust, in-generation, keyed audio watermarking framework that embeds identifiable signals into AI-generated audio for copyright protection and reliable attribution. Specifically, MelShield operates in the Mel-spectrogram domain during the generation process, targeting intermediate acoustic representations in Mel-conditioned pipelines for text-to-speech (TTS) generation. The core idea is to treat the intermediate Mel-spectrogram as the host signal and embed a short binary payload via low-energy, keyed spread-spectrum perturbations distributed across carefully selected time-frequency regions prior to waveform synthesis. By performing watermarking before vocoder inference, MelShield remains plug-and-play for Mel-conditioned TTS architectures and does not require modification or retraining of the underlying TTS generation vocoder, such as DiffWave and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.