The Blessing and Curse of Dimensionality in Safety Alignment
Rachel S.Y. Teo, Laziz U. Abdullaev, Tan M. Nguyen

TL;DR
This paper explores how high-dimensional representations in large language models can both aid and hinder safety alignment, revealing that dimensionality can be exploited to bypass safety measures, but also that reducing dimensions can improve safety.
Contribution
The study provides visualizations and empirical evidence showing the impact of high-dimensional spaces on safety, and introduces dimensional reduction as a method to mitigate linear exploits.
Findings
Dimensionality influences safety alignment and vulnerability to jailbreaking.
Reducing dimensions preserves alignment information while reducing exploitability.
Linear structures in high-dimensional spaces can be exploited to bypass safety measures.
Abstract
The focus on safety alignment in large language models (LLMs) has increased significantly due to their widespread adoption across different domains. The scale of LLMs play a contributing role in their success, and the growth in parameter count follows larger hidden dimensions. In this paper, we hypothesize that while the increase in dimensions has been a key advantage, it may lead to emergent problems as well. These problems emerge as the linear structures in the activation space can be exploited, in the form of activation engineering, to circumvent its safety alignment. Through detailed visualizations of linear subspaces associated with different concepts, such as safety, across various model scales, we show that the curse of high-dimensional representations uniquely impacts LLMs. Further substantiating our claim, we demonstrate that projecting the representations of the model onto a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
