Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities

Jiahui Geng; Thy Thy Tran; Preslav Nakov; Iryna Gurevych

arXiv:2506.00548·cs.CR·June 3, 2025

Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities

Jiahui Geng, Thy Thy Tran, Preslav Nakov, Iryna Gurevych

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Con Instruction, a novel method for creating non-textual adversarial examples that can bypass safety mechanisms in multimodal language models by exploiting their understanding of images and audio, revealing vulnerabilities.

Contribution

We propose a new attack method that generates non-textual adversarial inputs without training data, demonstrating significant safety bypasses in multiple multimodal models.

Findings

01

Achieves up to 86.6% attack success rate on LLaVA-v1.5

02

Effectively bypasses safety mechanisms in vision- and audio-language models

03

Uncovers performance gaps among existing defense techniques

Abstract

Existing attacks against multimodal language models (MLLMs) primarily communicate instructions through text accompanied by adversarial images. In contrast, we exploit the capabilities of MLLMs to interpret non-textual instructions, specifically, adversarial images or audio generated by our novel method, Con Instruction. We optimize these adversarial examples to align closely with target instructions in the embedding space, revealing the detrimental implications of MLLMs' sophisticated understanding. Unlike prior work, our method does not require training data or preprocessing of textual instructions. While these non-textual adversarial examples can effectively bypass MLLM safety mechanisms, their combination with various text inputs substantially amplifies attack success. We further introduce a new Attack Response Categorization (ARC) framework, which evaluates both the quality of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

UKPLab/acl2025-con-instruction
pytorchOfficial

Videos

Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Artificial Intelligence in Law

MethodsALIGN