Model-Editing-Based Jailbreak against Safety-aligned Large Language Models
Yuxi Li, Zhibo Zhang, Kailong Wang, Ling Shi, Haoyu Wang

TL;DR
This paper introduces Targeted Model Editing (TME), a white-box method that minimally alters LLMs to bypass safety filters without input modifications, achieving high attack success rates and exposing new security vulnerabilities.
Contribution
The paper proposes TME, a novel approach for model-based jailbreaks that directly modifies internal model structures, surpassing existing input-based methods in stealth and effectiveness.
Findings
Achieves an average Attack Success Rate (ASR) of 84.86% on four open-source LLMs.
Eliminates the need for specific triggers or harmful response collections.
Demonstrates a covert, robust threat vector in LLM security.
Abstract
Large Language Models (LLMs) have transformed numerous fields by enabling advanced natural language interactions but remain susceptible to critical vulnerabilities, particularly jailbreak attacks. Current jailbreak techniques, while effective, often depend on input modifications, making them detectable and limiting their stealth and scalability. This paper presents Targeted Model Editing (TME), a novel white-box approach that bypasses safety filters by minimally altering internal model structures while preserving the model's intended functionalities. TME identifies and removes safety-critical transformations (SCTs) embedded in model matrices, enabling malicious queries to bypass restrictions without input modifications. By analyzing distinct activation patterns between safe and unsafe queries, TME isolates and approximates SCTs through an optimization process. Implemented in the D-LLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Privacy-Preserving Technologies in Data
