Rethinking Audio-Visual Adversarial Vulnerability from Temporal and Modality Perspectives
Zeliang Zhang, Susan Liang, Daiki Shimada, Chenliang Xu

TL;DR
This paper investigates the adversarial vulnerabilities of audio-visual models from temporal and modality perspectives, proposing new attacks and a robust training framework to improve resilience against such threats.
Contribution
It introduces two novel adversarial attacks targeting temporal invariance and modality misalignment, along with a tailored adversarial training method for enhanced robustness.
Findings
Attacks significantly degrade model performance.
Proposed training improves robustness and efficiency.
Achieves state-of-the-art results on Kinetics-Sounds dataset.
Abstract
While audio-visual learning equips models with a richer understanding of the real world by leveraging multiple sensory modalities, this integration also introduces new vulnerabilities to adversarial attacks. In this paper, we present a comprehensive study of the adversarial robustness of audio-visual models, considering both temporal and modality-specific vulnerabilities. We propose two powerful adversarial attacks: 1) a temporal invariance attack that exploits the inherent temporal redundancy across consecutive time segments and 2) a modality misalignment attack that introduces incongruence between the audio and visual modalities. These attacks are designed to thoroughly assess the robustness of audio-visual models against diverse threats. Furthermore, to defend against such attacks, we introduce a novel audio-visual adversarial training framework. This framework addresses key…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection
