UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on   Multimodal Large Language Models

Sejoon Oh; Yiqiao Jin; Megha Sharma; Donghyun Kim; Eric Ma; Gaurav; Verma; Srijan Kumar

arXiv:2411.01703·cs.CL·February 3, 2025

UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models

Sejoon Oh, Yiqiao Jin, Megha Sharma, Donghyun Kim, Eric Ma, Gaurav, Verma, Srijan Kumar

PDF

Open Access

TL;DR

UniGuard is a universal safety mechanism for multimodal large language models that effectively defends against jailbreak attacks while preserving their core understanding abilities.

Contribution

It introduces a novel multimodal guardrail that considers both unimodal and cross-modal harmful signals, enhancing safety across various models and attack strategies.

Findings

01

Effective in reducing harmful outputs across multiple models

02

Maintains vision-language understanding capabilities

03

Generalizes well to different attack types

Abstract

Multimodal large language models (MLLMs) have revolutionized vision-language understanding but remain vulnerable to multimodal jailbreak attacks, where adversarial inputs are meticulously crafted to elicit harmful or inappropriate responses. We propose UniGuard, a novel multimodal safety guardrail that jointly considers the unimodal and cross-modal harmful signals. UniGuard trains a multimodal guardrail to minimize the likelihood of generating harmful responses in a toxic corpus. The guardrail can be seamlessly applied to any input prompt during inference with minimal computational costs. Extensive experiments demonstrate the generalizability of UniGuard across multiple modalities, attack strategies, and multiple state-of-the-art MLLMs, including LLaVA, Gemini Pro, GPT-4o, MiniGPT-4, and InstructBLIP. Notably, this robust defense mechanism maintains the models' overall vision-language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Adversarial Robustness in Machine Learning

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Multi-Head Attention · Residual Connection · Byte Pair Encoding · Dropout · Absolute Position Encodings