Atlas-Alignment: Making Interpretability Transferable Across Language Models
Bruno Puri, Jim Berend, Sebastian Lapuschkin, Wojciech Samek

TL;DR
Atlas-Alignment offers a scalable approach to interpretability by aligning new language models' latent spaces with a pre-existing Concept Atlas, enabling cost-effective transparency and control without extensive retraining.
Contribution
It introduces a lightweight alignment framework that transfers interpretability across models using shared inputs and a pre-existing Concept Atlas, reducing the transparency cost.
Findings
Enables semantic retrieval without labeled concept datasets
Allows steerable generation across models
Reduces interpretability costs significantly
Abstract
Interpretability is crucial for building safe, reliable, and controllable language models, yet existing interpretability pipelines remain costly and difficult to scale. Interpreting a new model typically requires training model-specific components (e.g., sparse autoencoders), followed by manual or semi-automated labeling and validation, imposing a growing "transparency tax" that does not scale with the pace of model development. We introduce Atlas-Alignment, a framework that avoids this cost by aligning the latent space of a new model to a pre-existing, labeled Concept Atlas using only shared inputs and lightweight representational alignment methods. Through quantitative and qualitative evaluations, we show that simple alignment methods enable robust semantic retrieval and steerable generation without the need for labeled concept datasets. Atlas-Alignment thus amortizes the cost of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
