Rethinking Audio-Visual Adversarial Vulnerability from Temporal and   Modality Perspectives

Zeliang Zhang; Susan Liang; Daiki Shimada; Chenliang Xu

arXiv:2502.11858·cs.SD·March 4, 2025

Rethinking Audio-Visual Adversarial Vulnerability from Temporal and Modality Perspectives

Zeliang Zhang, Susan Liang, Daiki Shimada, Chenliang Xu

PDF

Open Access

TL;DR

This paper investigates the adversarial vulnerabilities of audio-visual models from temporal and modality perspectives, proposing new attacks and a robust training framework to improve resilience against such threats.

Contribution

It introduces two novel adversarial attacks targeting temporal invariance and modality misalignment, along with a tailored adversarial training method for enhanced robustness.

Findings

01

Attacks significantly degrade model performance.

02

Proposed training improves robustness and efficiency.

03

Achieves state-of-the-art results on Kinetics-Sounds dataset.

Abstract

While audio-visual learning equips models with a richer understanding of the real world by leveraging multiple sensory modalities, this integration also introduces new vulnerabilities to adversarial attacks. In this paper, we present a comprehensive study of the adversarial robustness of audio-visual models, considering both temporal and modality-specific vulnerabilities. We propose two powerful adversarial attacks: 1) a temporal invariance attack that exploits the inherent temporal redundancy across consecutive time segments and 2) a modality misalignment attack that introduces incongruence between the audio and visual modalities. These attacks are designed to thoroughly assess the robustness of audio-visual models against diverse threats. Furthermore, to defend against such attacks, we introduce a novel audio-visual adversarial training framework. This framework addresses key…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Media Forensic Detection