mAVE: A Watermark for Joint Audio-Visual Generation Models

Luyang Si; Leyi Pan; Lijie Wen

arXiv:2603.07090·cs.CR·March 10, 2026

mAVE: A Watermark for Joint Audio-Visual Generation Models

Luyang Si, Leyi Pan, Lijie Wen

PDF

Open Access

TL;DR

mAVE introduces a cryptographic watermarking framework for joint audio-visual models that effectively prevents Swap Attacks by securely binding audio and video at the latent level, enhancing copyright protection.

Contribution

It is the first watermarking method designed specifically for joint audio-visual architectures, cryptographically binding modalities without fine-tuning.

Findings

01

Guarantees performance-losslessness.

02

Provides exponential security against Swap Attacks.

03

Achieves over 99% binding integrity.

Abstract

As Joint Audio-Visual Generation Models see widespread commercial deployment, embedding watermarks has become essential for protecting vendor copyright and ensuring content provenance. However, existing techniques suffer from an architectural mismatch by treating modalities as decoupled entities, exposing a critical Binding Vulnerability. Adversaries exploit this via Swap Attacks by replacing authentic audio with malicious deepfakes while retaining the watermarked video. Because current detectors rely on independent verification ( $V i d e o_{w m} \lor A u d i o_{w m}$ ), they incorrectly authenticate the manipulated content, falsely attributing harmful media to the original vendor and severely damaging their reputation. To address this, we propose mAVE (Manifold Audio-Visual Entanglement), the first watermarking framework natively designed for joint architectures. mAVE cryptographically binds audio…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning · Digital Media Forensic Detection