Large Language Models Encode Semantics and Alignment in Linearly Separable Representations

Baturay Saglam; Paul Kassianik; Blaine Nelson; Sajana Weerawardhena; Yaron Singer; Amin Karbasi

arXiv:2507.09709·cs.CL·January 22, 2026

Large Language Models Encode Semantics and Alignment in Linearly Separable Representations

Baturay Saglam, Paul Kassianik, Blaine Nelson, Sajana Weerawardhena, Yaron Singer, Amin Karbasi

PDF

Open Access

TL;DR

This paper demonstrates that large language models encode semantic information in linearly separable, low-dimensional subspaces, which can be exploited to improve safety and alignment through geometry-aware tools.

Contribution

It provides the first large-scale empirical evidence that high-level semantics in LLMs are linearly separable, enabling new methods for alignment and safety interventions.

Findings

01

Semantic info resides in low-dimensional, linearly separable subspaces.

02

Separable representations become more distinct in deeper layers.

03

A simple MLP probe improves safety by detecting harmful content.

Abstract

Understanding the latent space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment. Yet it remains unclear to what extent LLMs linearly organize representations related to semantic understanding. To explore this, we conduct a large-scale empirical study of hidden representations in 11 autoregressive models across six scientific topics. We find that high-level semantic information consistently resides in low-dimensional subspaces that form linearly separable representations across domains. This separability becomes more pronounced in deeper layers and under prompts that elicit structured reasoning or alignment behavior $\unicode x 2013$ even when surface content remains unchanged. These findings motivate geometry-aware tools that operate directly in latent space to detect and mitigate harmful and adversarial content. As a proof of concept,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)