Arondight: Red Teaming Large Vision Language Models with Auto-generated   Multi-modal Jailbreak Prompts

Yi Liu; Chengjun Cai; Xiaoli Zhang; Xingliang Yuan; Cong Wang

arXiv:2407.15050·cs.LG·July 23, 2024

Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts

Yi Liu, Chengjun Cai, Xiaoli Zhang, Xingliang Yuan, Cong Wang

PDF

Open Access

TL;DR

This paper introduces Arondight, a comprehensive red teaming framework for evaluating the security of large vision language models (VLMs), revealing significant vulnerabilities in generating harmful content and proposing methods to improve safety assessments.

Contribution

We develop Arondight, a novel automated multi-modal red team framework for VLMs, incorporating reinforcement learning guided prompt generation and diversity metrics, addressing gaps in existing security evaluation methods.

Findings

01

Achieved an 84.5% attack success rate on GPT-4 in toxic prompt scenarios.

02

Exposed significant vulnerabilities in current VLMs regarding harmful content generation.

03

Provided a categorized safety level assessment and reinforcement learning recommendations for VLMs.

Abstract

Large Vision Language Models (VLMs) extend and enhance the perceptual abilities of Large Language Models (LLMs). Despite offering new possibilities for LLM applications, these advancements raise significant security and ethical concerns, particularly regarding the generation of harmful content. While LLMs have undergone extensive security evaluations with the aid of red teaming frameworks, VLMs currently lack a well-developed one. To fill this gap, we introduce Arondight, a standardized red team framework tailored specifically for VLMs. Arondight is dedicated to resolving issues related to the absence of visual modality and inadequate diversity encountered when transitioning existing red teaming methodologies from LLMs to VLMs. Our framework features an automated multi-modal jailbreak attack, wherein visual jailbreak prompts are produced by a red team VLM, and textual prompts are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics

MethodsAdam · Label Smoothing · Linear Layer · Byte Pair Encoding · Layer Normalization · Softmax · Attention Is All You Need · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dense Connections