Attacks on multimodal models

Viacheslav Iablochnikov; Alexander Rogachev

arXiv:2412.01725·cs.CV·December 3, 2024

Attacks on multimodal models

Viacheslav Iablochnikov, Alexander Rogachev

PDF

Open Access 1 Repo

TL;DR

This paper investigates vulnerabilities in multimodal models, especially focusing on attacks on pre-trained components like CLIP, to understand their security risks and generalization capabilities in practical applications.

Contribution

It provides a comprehensive analysis of attack methods on multimodal models, emphasizing vulnerabilities inherited from open-source components like CLIP.

Findings

01

Vulnerabilities in CLIP-based image encoders are exploitable through patch attacks.

02

Multimodal models show varying robustness depending on attack type and component used.

03

Open-source pre-trained models can inherit and amplify security risks.

Abstract

Today, models capable of working with various modalities simultaneously in a chat format are gaining increasing popularity. Despite this, there is an issue of potential attacks on these models, especially considering that many of them include open-source components. It is important to study whether the vulnerabilities of these components are inherited and how dangerous this can be when using such models in the industry. This work is dedicated to researching various types of attacks on such models and evaluating their generalization capabilities. Modern VLM models (LLaVA, BLIP, etc.) often use pre-trained parts from other models, so the main part of this research focuses on them, specifically on the CLIP architecture and its image encoder (CLIP-ViT) and various patch attack variations for it.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

slava-qw/image-retrieval-robustness
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation

MethodsContrastive Language-Image Pre-training · BLIP: Bootstrapping Language-Image Pre-training