Safety Subspaces are Not Linearly Distinct: A Fine-Tuning Case Study

Kaustubh Ponkshe; Shaan Shah; Raghav Singhal; Praneeth Vepakomma

arXiv:2505.14185·cs.LG·February 10, 2026

Safety Subspaces are Not Linearly Distinct: A Fine-Tuning Case Study

Kaustubh Ponkshe, Shaan Shah, Raghav Singhal, Praneeth Vepakomma

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This study empirically investigates whether safety-related behaviors in large language models are localized in specific linear subspaces and finds that safety is highly entangled with general model functions, challenging subspace-based safety defenses.

Contribution

The paper provides a comprehensive empirical analysis showing safety behaviors are not confined to distinct subspaces but are entangled with general model representations, highlighting limitations of subspace-based safety methods.

Findings

01

Safety and useful behaviors activate overlapping subspaces.

02

Safety behaviors are entangled with general-purpose learning components.

03

Subspace-based defenses face fundamental limitations due to entanglement.

Abstract

Large Language Models (LLMs) rely on safety alignment to produce socially acceptable responses. However, this behavior is known to be brittle: further fine-tuning, even on benign or lightly contaminated data, can degrade safety and reintroduce harmful behaviors. A growing body of work suggests that alignment may correspond to identifiable directions in weight space, forming subspaces that could, in principle, be isolated or preserved to defend against misalignment. In this work, we conduct a comprehensive empirical study of this perspective. We examine whether safety-relevant behavior is concentrated in specific linear subspaces, whether it can be separated from general-purpose learning, and whether harmfulness arises from distinguishable patterns in activations. Across both weight and activation spaces, our findings are consistent: subspaces that amplify safe behaviors also amplify…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The paper does an extensive empirical analysis extends from weight update directions to activations, and evaluates across multiple open source LLM families (Llama, Qwen) and fine tuning regimes (helpful, harmful, contaminated). - The work draws a concrete practical implication: subspace-based defenses cannot selectively suppress harmful behavior without proportional utility loss, guiding future safety strategies away from linear subspace filtering. - The results provide consistent evidence in

Weaknesses

- The focus of this work focus on defense mechanisms based on linear orthogonal projections, leaving non-linear alternatives unexplored. - The paper reports activation-space results as averages over a mid-depth slice (about 65–90% of layers) rather than layer by layer, which may not fully capture layer-specific behavior. - The paper lacks a theoretical investigation of the results, relying on empirical evidence without a formal framework explaining why safety and utility directions overlap. This

Reviewer 02Rating 4Confidence 3

Strengths

1. The paper formalizes “safety subspaces” concretely by building them from principal components of weight differences. Although PCA on weight deltas is common in task-vector or model diffing work, applying it here provides an interesting way to identify safety regions beyond analyses that focus only on activations popular in recent AI safety works. 2. It introduces two sound measures of subspace overlap that are intuitively meaningful, providing a clear alternative to relying solely on downstr

Weaknesses

1. The conclusions may depend on the specific design choices for subspace dimension and PCA aggregation (it seems that SafeMERGE supports a form of layer-wise separability). The paper fixes the number of principal components and the layer aggregation strategy without sensitivity analysis, so the observed overlap might partially reflect those implementation choices rather than a conclusive property. 2. The experiments use a single alignment method and do not test how geometric overlap changes un

Reviewer 03Rating 4Confidence 4

Strengths

1. **Significance:** The paper addresses broadly believed hypothesis that a concept-related features are essentially contained in a specific lower dimensional subspace. 2. **Novelty:** While mostly building upon [1], the authors take a critical standpoint to better investigate the source of the positive results in [1] and tradeoffs it poses (utility drops similarly with the removal of "harmful subspace"). There have been works to observe that most jailbreak defenses are also likely to hurt the g

Weaknesses

Although the paper claims their findings for both weight space and activation space, the activation space seems to be relatively underexplored. Considering that LLM steering is predominantly done on its activations (Activation Addition, Directional Ablation, Manifold Steering, Angular Steering etc.), I think the paper does not deliver sufficiently in this important aspect. The following questions are to specify why exactly I think the current study is insufficient.

Code & Models

Repositories

cert-lab/safety-subspaces
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRisk and Safety Analysis · Safety Systems Engineering in Autonomy · Software Reliability and Analysis Research