Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment

Soumya Suvra Ghosal; Souradip Chakraborty; Vaibhav Singh; Tianrui Guan; Mengdi Wang; Alvaro Velasquez; Ahmad Beirami; Furong Huang; Dinesh Manocha; Amrit Singh Bedi

arXiv:2411.18688·cs.CR·June 17, 2025

Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment

Soumya Suvra Ghosal, Souradip Chakraborty, Vaibhav Singh, Tianrui Guan, Mengdi Wang, Alvaro Velasquez, Ahmad Beirami, Furong Huang, Dinesh Manocha, Amrit Singh Bedi

PDF

Open Access 1 Repo

TL;DR

This paper introduces Immune, an inference-time safety framework for multimodal large language models that significantly reduces jailbreak success rates without compromising model capabilities.

Contribution

The paper proposes a novel inference-time defense mechanism, Immune, that enhances safety against jailbreaks in multimodal LLMs, addressing limitations of training-time alignment.

Findings

01

Immune reduces jailbreak success rate by over 57% on LLaVA-1.6.

02

It effectively defends against diverse jailbreak benchmarks.

03

Preserves the original capabilities of the models.

Abstract

With the widespread deployment of Multimodal Large Language Models (MLLMs) for visual-reasoning tasks, improving their safety has become crucial. Recent research indicates that despite training-time safety alignment, these models remain vulnerable to jailbreak attacks. In this work, we first highlight an important safety gap to describe that alignment achieved solely through safety training may be insufficient against jailbreak attacks. To address this vulnerability, we propose Immune, an inference-time defense framework that leverages a safe reward model through controlled decoding to defend against jailbreak attacks. Additionally, we provide a mathematical characterization of Immune, offering insights on why it improves safety against jailbreaks. Extensive evaluations on diverse jailbreak benchmarks using recent MLLMs reveal that Immune effectively enhances model safety while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

itsvaibhav01/Immune
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImbalanced Data Classification Techniques · Anomaly Detection Techniques and Applications · Artificial Intelligence in Law

MethodsBalanced Selection