Finding Belief Geometries with Sparse Autoencoders
Matthew Levinson

TL;DR
This paper introduces a pipeline combining autoencoders, clustering, and simplex fitting to identify belief-like geometric structures in transformer representations, validated on a known model and applied to Gemma-2-9B.
Contribution
It presents a novel method for discovering and validating belief-state geometries in large language models' internal representations.
Findings
Identified 13 candidate clusters with simplex geometry in Gemma-2-9B.
Three clusters showed significant belief-state encoding on vertex samples.
One cluster achieved the highest causal steering score, indicating belief-like geometry.
Abstract
Understanding the geometric structure of internal representations is a central goal of mechanistic interpretability. Prior work has shown that transformers trained on sequences generated by hidden Markov models encode probabilistic belief states as simplex-shaped geometries in their residual stream, with vertices corresponding to latent generative states. Whether large language models trained on naturalistic text develop analogous geometric representations remains an open question. We introduce a pipeline for discovering candidate simplex-structured subspaces in transformer representations, combining sparse autoencoders (SAEs), -subspace clustering of SAE features, and simplex fitting using AANet. We validate the pipeline on a transformer trained on a multipartite hidden Markov model with known belief-state geometry. Applied to Gemma-2-9B, we identify 13 priority clusters…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
