TL;DR
SyncBreaker is a stage-aware multimodal adversarial attack framework that jointly perturbs audio and image inputs to effectively disrupt audio-driven talking head generation while maintaining perceptual quality.
Contribution
It introduces a novel multimodal protection method with stage-aware perturbations, including nullifying supervision with MIS and cross-attention fooling, outperforming single-modality baselines.
Findings
SyncBreaker more effectively degrades lip sync and facial dynamics.
It preserves input perceptual quality.
It remains robust under purification.
Abstract
Diffusion-based audio-driven talking-head generation enables realistic portrait animation, but also introduces risks of misuse, such as fraud and misinformation. Existing protection methods are largely limited to a single modality, and neither image-only nor audio-only attacks can effectively suppress speech-driven facial dynamics. To address this gap, we propose SyncBreaker, a stage-aware multimodal protection framework that jointly perturbs portrait and audio inputs under modality-specific perceptual constraints. Our key contributions are twofold. First, for the image stream, we introduce nullifying supervision with Multi-Interval Sampling (MIS) across diffusion stages to steer the generation toward the static reference portrait by aggregating guidance from multiple denoising intervals. Second, for the audio stream, we propose Cross-Attention Fooling (CAF), which suppresses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
