Towards Identification and Intervention of Safety-Critical Parameters in Large Language Models

Weiwei Qi; Zefeng Wu; Tianhang Zheng; Zikang Zhang; Xiaojun Jia; Zhan Qin; Kui Ren

arXiv:2604.08297·cs.CR·April 10, 2026

Towards Identification and Intervention of Safety-Critical Parameters in Large Language Models

Weiwei Qi, Zefeng Wu, Tianhang Zheng, Zikang Zhang, Xiaojun Jia, Zhan Qin, Kui Ren

PDF

1 Repo

TL;DR

This paper introduces the Expected Safety Impact framework to identify safety-critical parameters in LLMs and proposes two intervention methods, SET and SPA, to improve safety without compromising performance.

Contribution

It presents a novel ESI framework for understanding safety-critical parameters and introduces targeted intervention paradigms for safety enhancement and preservation in LLMs.

Findings

01

SET reduces attack success rates by over 50% with minimal weight updates.

02

SPA limits safety degradation within 1% during instruction fine-tuning.

03

Distinct safety-critical patterns are identified across different LLM architectures.

Abstract

Ensuring Large Language Model (LLM) safety is crucial, yet the lack of a clear understanding about safety mechanisms hinders the development of precise and reliable methodologies for safety intervention across diverse tasks. To better understand and control LLM safety, we propose the Expected Safety Impact (ESI) framework for quantifying how different parameters affect LLM safety. Based on ESI, we reveal distinct safety-critical patterns across different LLM architectures: In dense LLMs, many safety-critical parameters are located in value matrices (V) and MLPs in middle layers, whereas in Mixture-of-Experts (MoE) models, they shift to the late-layer MLPs. Leveraging ESI, we further introduce two targeted intervention paradigms for safety enhancement and preservation, i.e., Safety Enhancement Tuning (SET) and Safety Preserving Adaptation (SPA). SET can align unsafe LLMs by updating only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ZJU-LLM-Safety/SafeWeights-ACL
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.