Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization
Mengqi He, Xinyu Tian, Xin Shen, Shu Zou, Jinhong Ni, Zhaoyuan Yang, Weikang Li, Xuesong Li, Jing Zhang

TL;DR
This paper introduces UJEM-KL, a lightweight untargeted attack that maximizes entropy at decision tokens to improve transferability of jailbreaks on vision-language models, challenging prior assumptions about transferability limitations.
Contribution
The paper proposes a novel entropy maximization attack method that enhances transferability of untargeted jailbreaks on VLMs, with comprehensive evaluation across models and benchmarks.
Findings
UJEM-KL achieves high success rates in white-box attacks.
The method improves transferability across different models.
Transferability limitations are mainly due to constrained optimization objectives.
Abstract
Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
