TL;DR
This paper investigates the robustness of audio-visual classification models against adversarial noises, analyzing fusion strategies, feature contributions, and neural module vulnerabilities to improve understanding and robustness.
Contribution
It introduces a comprehensive study of adversarial attacks on multimodal audio-visual models, revealing insights into fusion strategies and feature robustness.
Findings
Early/middle/late fusion impacts robustness and accuracy
Different frequency/time features contribute variably to robustness
Neural modules exhibit distinct vulnerabilities to adversarial noise
Abstract
As audio/visual classification models are widely deployed for sensitive tasks like content filtering at scale, it is critical to understand their robustness along with improving the accuracy. This work aims to study several key questions related to multimodal learning through the lens of adversarial noises: 1) The trade-off between early/middle/late fusion affecting its robustness and accuracy 2) How do different frequency/time domain features contribute to the robustness? 3) How do different neural modules contribute to the adversarial noise? In our experiment, we construct adversarial examples to attack state-of-the-art neural models trained on Google AudioSet. We compare how much attack potency in terms of adversarial perturbation of size using different norms we would need to "deactivate" the victim model. Using adversarial noise to ablate multimodal models, we are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
