Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Daniel Wurgaft, Can Rager, Matthew Kowal, Vasudev Shyam, Sheridan Feucht, Usha Bhalla, Tal Haklay, Eric Bigelow, Raphael Sarfati, Thomas McGrath, Owen Lewis, Jack Merullo, Noah Goodman, Thomas Fel, Atticus Geiger, Ekdeep Singh Lubana

TL;DR
This paper demonstrates that understanding and manipulating the geometric structure of neural representations enables more natural and effective control of model behavior across various tasks and modalities.
Contribution
It introduces manifold steering, a method that respects activation space geometry, revealing a bidirectional link between neural representation geometry and behavior.
Findings
Steering along the activation manifold yields natural behavioral trajectories.
Linear steering in Euclidean space produces off-manifold, unnatural outputs.
Optimizing interventions along behavior manifolds traces the curvature of activation manifolds.
Abstract
Neural representations carry rich geometric structure; but does that structure causally shape behavior? To address this question, we intervene along paths through activation space defined by different geometries, and measure the behavioral trajectories they induce. In particular, we test whether interventions that respect the geometry of activation space will yield behaviors close to those the model exhibits naturally. Concretely, we first fit an activation manifold to representations and a behavior manifold to output probability distributions. We then test the link via interventions: we find that steering along , which we term manifold steering, yields behavioral trajectories that follow , while linear steering -- which assumes a Euclidean geometry -- cuts through off-manifold regions and hence produces unnatural outputs. Moreover,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
