Enhancing Safety of Large Language Models via Embedding Space Separation

Xu Zhao; Xiting Wang; Weiran Shen

arXiv:2603.20206·cs.CL·March 24, 2026

Enhancing Safety of Large Language Models via Embedding Space Separation

Xu Zhao, Xiting Wang, Weiran Shen

PDF

Open Access

TL;DR

This paper introduces Embedding Space Separation (ES2), a fine-tuning method that enhances large language model safety by increasing the distance between harmful and safe query representations in the embedding space, while preserving overall performance.

Contribution

The paper proposes a novel embedding space separation technique with KL regularization for improving LLM safety without sacrificing capabilities.

Findings

01

Significant safety improvements on open-source LLMs

02

Maintains comparable general capabilities

03

Effective in standard safety benchmarks

Abstract

Large language models (LLMs) have achieved impressive capabilities, yet ensuring their safety against harmful prompts remains a critical challenge. Recent work has revealed that the latent representations (embeddings) of harmful and safe queries in LLMs typically exhibit linear separability, a property that has been exploited to construct attacks by perturbing the embeddings of harmful queries towards the safe subspace. Motivated by this observation, we propose a representation-level fine-tuning approach, named Embedding Space Separation (ES2), which improves LLM safety by explicitly enlarging the distance between harmful and safe representations in the embedding space. To prevent degradation of model's general capabilities, we introduce a Kullback-Leibler (KL) divergence regularization term into the loss function, which constrains the logits of the fine-tuned model to align with those…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Natural Language Processing Techniques