The Cylindrical Representation Hypothesis for Language Model Steering

Lang Gao; Jinghui Zhang; Wei Liu; Fengxian Ji; Chenxi Wang; Zirui Song; Akash Ghosh; Youssef Mohamed; Preslav Nakov; Xiuying Chen

arXiv:2605.01844·cs.CL·May 5, 2026

The Cylindrical Representation Hypothesis for Language Model Steering

Lang Gao, Jinghui Zhang, Wei Liu, Fengxian Ji, Chenxi Wang, Zirui Song, Akash Ghosh, Youssef Mohamed, Preslav Nakov, Xiuying Chen

PDF

1 Repo 1 Datasets

TL;DR

The paper proposes the Cylindrical Representation Hypothesis (CRH) to explain the instability and unpredictability of language model steering, emphasizing a cylindrical structure in concept representations.

Contribution

It introduces CRH as a new framework that accounts for overlapping concepts and intrinsic uncertainties, improving understanding of model steering behavior.

Findings

01

CRH reveals a cylindrical structure in concept representations.

02

Steering sensitivity is controlled by a normal plane around the main axis.

03

Uncertainty at the sector level explains fluctuations in steering outcomes.

Abstract

Steering is a widely used technique for controlling large language models, yet its effects are often unstable and hard to predict. Existing theoretical accounts are largely based on the Linear Representation Hypothesis (LRH). While LRH assumes that concepts can be orthogonalized for lossless control, this idealized mapping fails in real representations and cannot account for the observed unpredictability of steering. By relaxing LRH's orthogonality assumption while preserving linear representations, we show that overlapping concept contributions naturally yield a sample-specific axis-orthogonal structure. We formalize this as the Cylindrical Representation Hypothesis (CRH). In CRH, a central axis captures the main difference between concept absence and presence and drives concept generation. A surrounding normal plane controls steering sensitivity by determining how easily the axis can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mbzuai-nlp/CRH
github

Datasets

LangGao/CRH_Data
dataset· 48 dl
48 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.