Efficient Jailbreaking of Large Models by Freeze Training: Lower Layers   Exhibit Greater Sensitivity to Harmful Content

Hongyuan Shen; Min Zheng; Jincheng Wang; Yang Zhao

arXiv:2502.20952·cs.CR·March 3, 2025

Efficient Jailbreaking of Large Models by Freeze Training: Lower Layers Exhibit Greater Sensitivity to Harmful Content

Hongyuan Shen, Min Zheng, Jincheng Wang, Yang Zhao

PDF

TL;DR

This paper introduces a freeze training method targeting lower layers of large language models, significantly reducing training time and resources while maintaining effectiveness in generating harmful content, and offers new insights into model interpretability.

Contribution

The study presents a novel freeze training approach focusing on lower layers, demonstrating improved efficiency and security in large language models compared to existing methods.

Findings

01

Lower layers are more sensitive to harmful content.

02

Freeze training on lower layers reduces training time and memory usage.

03

The method outperforms LoRA in jailbreak success rate and harm score.

Abstract

With the widespread application of Large Language Models across various domains, their security issues have increasingly garnered significant attention from both academic and industrial communities. This study conducts sampling and normalization of the parameters of the LLM to generate visual representations and heatmaps of parameter distributions, revealing notable discrepancies in parameter distributions among certain layers within the hidden layers. Further analysis involves calculating statistical metrics for each layer, followed by the computation of a Comprehensive Sensitivity Score based on these metrics, which identifies the lower layers as being particularly sensitive to the generation of harmful content. Based on this finding, we employ a Freeze training strategy, selectively performing Supervised Fine-Tuning only on the lower layers. Experimental results demonstrate that this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need · Shrink and Fine-Tune