The Cylindrical Representation Hypothesis for Language Model Steering
Lang Gao, Jinghui Zhang, Wei Liu, Fengxian Ji, Chenxi Wang, Zirui Song, Akash Ghosh, Youssef Mohamed, Preslav Nakov, Xiuying Chen

TL;DR
The paper proposes the Cylindrical Representation Hypothesis (CRH) to explain the instability and unpredictability of language model steering, emphasizing a cylindrical structure in concept representations.
Contribution
It introduces CRH as a new framework that accounts for overlapping concepts and intrinsic uncertainties, improving understanding of model steering behavior.
Findings
CRH reveals a cylindrical structure in concept representations.
Steering sensitivity is controlled by a normal plane around the main axis.
Uncertainty at the sector level explains fluctuations in steering outcomes.
Abstract
Steering is a widely used technique for controlling large language models, yet its effects are often unstable and hard to predict. Existing theoretical accounts are largely based on the Linear Representation Hypothesis (LRH). While LRH assumes that concepts can be orthogonalized for lossless control, this idealized mapping fails in real representations and cannot account for the observed unpredictability of steering. By relaxing LRH's orthogonality assumption while preserving linear representations, we show that overlapping concept contributions naturally yield a sample-specific axis-orthogonal structure. We formalize this as the Cylindrical Representation Hypothesis (CRH). In CRH, a central axis captures the main difference between concept absence and presence and drives concept generation. A surrounding normal plane controls steering sensitivity by determining how easily the axis can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
