Model-Editing-Based Jailbreak against Safety-aligned Large Language   Models

Yuxi Li; Zhibo Zhang; Kailong Wang; Ling Shi; Haoyu Wang

arXiv:2412.08201·cs.CR·December 12, 2024

Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

Yuxi Li, Zhibo Zhang, Kailong Wang, Ling Shi, Haoyu Wang

PDF

Open Access

TL;DR

This paper introduces Targeted Model Editing (TME), a white-box method that minimally alters LLMs to bypass safety filters without input modifications, achieving high attack success rates and exposing new security vulnerabilities.

Contribution

The paper proposes TME, a novel approach for model-based jailbreaks that directly modifies internal model structures, surpassing existing input-based methods in stealth and effectiveness.

Findings

01

Achieves an average Attack Success Rate (ASR) of 84.86% on four open-source LLMs.

02

Eliminates the need for specific triggers or harmful response collections.

03

Demonstrates a covert, robust threat vector in LLM security.

Abstract

Large Language Models (LLMs) have transformed numerous fields by enabling advanced natural language interactions but remain susceptible to critical vulnerabilities, particularly jailbreak attacks. Current jailbreak techniques, while effective, often depend on input modifications, making them detectable and limiting their stealth and scalability. This paper presents Targeted Model Editing (TME), a novel white-box approach that bypasses safety filters by minimally altering internal model structures while preserving the model's intended functionalities. TME identifies and removes safety-critical transformations (SCTs) embedded in model matrices, enabling malicious queries to bypass restrictions without input modifications. By analyzing distinct activation patterns between safe and unsafe queries, TME isolates and approximates SCTs through an optimization process. Implemented in the D-LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Privacy-Preserving Technologies in Data