Adversarial Attacks in Multimodal Systems: A Practitioner's Survey

Shashank Kapoor; Sanjay Surendranath Girija; Lakshit Arora; Dipen Pradhan; Ankit Shetgaonkar; Aman Raj

arXiv:2505.03084·cs.LG·September 3, 2025

Adversarial Attacks in Multimodal Systems: A Practitioner's Survey

Shashank Kapoor, Sanjay Surendranath Girija, Lakshit Arora, Dipen Pradhan, Ankit Shetgaonkar, Aman Raj

PDF

TL;DR

This paper provides the first comprehensive survey of adversarial attacks across all four modalities—text, image, video, and audio—in multimodal AI systems, highlighting vulnerabilities and evolving threats for practitioners.

Contribution

It offers a practitioner-focused overview of adversarial attack types in multimodal systems, filling a gap in existing research by summarizing the threat landscape comprehensively.

Findings

01

Identifies key attack types across modalities

02

Highlights the evolution of multimodal adversarial threats

03

Provides guidance for practitioners to mitigate risks

Abstract

The introduction of multimodal models is a huge step forward in Artificial Intelligence. A single model is trained to understand multiple modalities: text, image, video, and audio. Open-source multimodal models have made these breakthroughs more accessible. However, considering the vast landscape of adversarial attacks across these modalities, these models also inherit vulnerabilities of all the modalities, and ultimately, the adversarial threat amplifies. While broad research is available on possible attacks within or across these modalities, a practitioner-focused view that outlines attack types remains absent in the multimodal world. As more Machine Learning Practitioners adopt, fine-tune, and deploy open-source models in real-world applications, it's crucial that they can view the threat landscape and take the preventive actions necessary. This paper addresses the gap by surveying…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.