mAVE: A Watermark for Joint Audio-Visual Generation Models
Luyang Si, Leyi Pan, Lijie Wen

TL;DR
mAVE introduces a cryptographic watermarking framework for joint audio-visual models that effectively prevents Swap Attacks by securely binding audio and video at the latent level, enhancing copyright protection.
Contribution
It is the first watermarking method designed specifically for joint audio-visual architectures, cryptographically binding modalities without fine-tuning.
Findings
Guarantees performance-losslessness.
Provides exponential security against Swap Attacks.
Achieves over 99% binding integrity.
Abstract
As Joint Audio-Visual Generation Models see widespread commercial deployment, embedding watermarks has become essential for protecting vendor copyright and ensuring content provenance. However, existing techniques suffer from an architectural mismatch by treating modalities as decoupled entities, exposing a critical Binding Vulnerability. Adversaries exploit this via Swap Attacks by replacing authentic audio with malicious deepfakes while retaining the watermarked video. Because current detectors rely on independent verification (), they incorrectly authenticate the manipulated content, falsely attributing harmful media to the original vendor and severely damaging their reputation. To address this, we propose mAVE (Manifold Audio-Visual Entanglement), the first watermarking framework natively designed for joint architectures. mAVE cryptographically binds audio…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning · Digital Media Forensic Detection
