Investigating Vulnerabilities and Defenses Against Audio-Visual Attacks: A Comprehensive Survey Emphasizing Multimodal Models

Jinming Wen; Xinyi Wu; Shuai Zhao; Yanhao Jia; Yuwen Li

arXiv:2506.11521·cs.CR·June 16, 2025

Investigating Vulnerabilities and Defenses Against Audio-Visual Attacks: A Comprehensive Survey Emphasizing Multimodal Models

Jinming Wen, Xinyi Wu, Shuai Zhao, Yanhao Jia, Yuwen Li

PDF

Open Access

TL;DR

This paper provides a comprehensive survey of security vulnerabilities in multimodal large language models, focusing on audio-visual attacks, including adversarial, backdoor, and jailbreak methods, and discusses future research directions.

Contribution

It offers the first unified review of various attack types on audio-visual models, highlighting recent trends and challenges in security and defense strategies.

Findings

01

MLLMs can be manipulated via instructions and inputs.

02

Existing surveys lack a unified review of attack types.

03

Future research needs to address emerging attack methods.

Abstract

Multimodal large language models (MLLMs), which bridge the gap between audio-visual and natural language processing, achieve state-of-the-art performance on several audio-visual tasks. Despite the superior performance of MLLMs, the scarcity of high-quality audio-visual training data and computational resources necessitates the utilization of third-party data and open-source MLLMs, a trend that is increasingly observed in contemporary research. This prosperity masks significant security risks. Empirical studies demonstrate that the latest MLLMs can be manipulated to produce malicious or harmful content. This manipulation is facilitated exclusively through instructions or inputs, including adversarial perturbations and malevolent queries, effectively bypassing the internal security mechanisms embedded within the models. To gain a deeper comprehension of the inherent security…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Advanced Malware Detection Techniques · Subtitles and Audiovisual Media