Can LLM Safety Be Ensured by Constraining Parameter Regions?
Zongmin Li, Jian Su, Farah Benamara, Aixin Sun

TL;DR
This paper systematically evaluates methods for identifying safety regions in LLMs across various models and datasets, revealing that current techniques lack stability and dataset independence in defining safety parameters.
Contribution
The study provides a comprehensive comparison of safety region identification methods, highlighting their limitations in stability and dataset generalization across multiple LLM families.
Findings
Safety regions have low to moderate overlap across methods.
Refinement with utility datasets reduces safety region overlap.
Current techniques do not reliably identify dataset-agnostic safety regions.
Abstract
Large language models (LLMs) are often assumed to contain ``safety regions'' -- parameter subsets whose modification directly influences safety behaviors. We conduct a systematic evaluation of four safety region identification methods spanning different parameter granularities, from individual weights to entire Transformer layers, across four families of backbone LLMs with varying sizes. Using ten safety identification datasets, we find that the identified safety regions exhibit only low to moderate overlap, as measured by IoU. The overlap drops significantly when the safety regions are further refined using utility datasets (\ie non-harmful queries). These results suggest that current techniques fail to reliably identify a stable, dataset-agnostic safety region.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Natural Language Processing Techniques
