HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States

Yilei Jiang; Xinyan Gao; Tianshuo Peng; Yingshui Tan; Xiaoyong Zhu; Bo Zheng; Xiangyu Yue

arXiv:2502.14744·cs.CL·June 24, 2025

HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States

Yilei Jiang, Xinyan Gao, Tianshuo Peng, Yingshui Tan, Xiaoyong Zhu, Bo Zheng, Xiangyu Yue

PDF

Open Access 1 Repo

TL;DR

HiddenDetect is a novel, tuning-free framework that detects jailbreak attacks on large vision-language models by monitoring internal activation patterns, offering an efficient and scalable safety enhancement without extensive fine-tuning.

Contribution

We introduce HiddenDetect, a new method leveraging internal model activations to detect unsafe prompts in LVLMs without additional training.

Findings

01

Outperforms state-of-the-art jailbreak detection methods

02

Effectively identifies unsafe prompts via activation pattern analysis

03

Provides a scalable solution for LVLM safety enhancement

Abstract

The integration of additional modalities increases the susceptibility of large vision-language models (LVLMs) to safety risks, such as jailbreak attacks, compared to their language-only counterparts. While existing research primarily focuses on post-hoc alignment techniques, the underlying safety mechanisms within LVLMs remain largely unexplored. In this work , we investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference. Our findings reveal that LVLMs exhibit distinct activation patterns when processing unsafe prompts, which can be leveraged to detect and mitigate adversarial inputs without requiring extensive fine-tuning. Building on this insight, we introduce HiddenDetect, a novel tuning-free framework that harnesses internal model activations to enhance safety. Experimental results show that {HiddenDetect} surpasses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

leigest519/hiddendetect
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Digital and Cyber Forensics