SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism

Beitao Chen; Xinyu Lyu; Lianli Gao; Jingkuan Song; Heng Tao Shen

arXiv:2507.01513·cs.CR·December 4, 2025

SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism

Beitao Chen, Xinyu Lyu, Lianli Gao, Jingkuan Song, Heng Tao Shen

PDF

Open Access

TL;DR

SafePTR is a training-free method that enhances multimodal large language models' safety by selectively pruning harmful tokens at vulnerable layers, effectively mitigating jailbreak risks while maintaining efficiency.

Contribution

This paper introduces SafePTR, a novel prune-then-restore framework that precisely removes harmful multimodal tokens without additional training, improving safety against jailbreaks in MLLMs.

Findings

01

SafePTR significantly reduces jailbreak success rates across multiple models and benchmarks.

02

It preserves model utility and efficiency without additional training overhead.

03

Less than 1% of tokens in early-middle layers cause unsafe behaviors, enabling targeted pruning.

Abstract

By incorporating visual inputs, Multimodal Large Language Models (MLLMs) extend LLMs to support visual reasoning. However, this integration also introduces new vulnerabilities, making MLLMs susceptible to multimodal jailbreak attacks and hindering their safe deployment.Existing defense methods, including Image-to-Text Translation, Safe Prompting, and Multimodal Safety Tuning, attempt to address this by aligning multimodal inputs with LLMs' built-in safeguards.Yet, they fall short in uncovering root causes of multimodal vulnerabilities, particularly how harmful multimodal tokens trigger jailbreak in MLLMs? Consequently, they remain vulnerable to text-driven multimodal jailbreaks, often exhibiting overdefensive behaviors and imposing heavy training overhead.To bridge this gap, we present an comprehensive analysis of where, how and which harmful multimodal tokens bypass safeguards in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Advanced Neural Network Applications