Atlas-Alignment: Making Interpretability Transferable Across Language Models

Bruno Puri; Jim Berend; Sebastian Lapuschkin; Wojciech Samek

arXiv:2510.27413·cs.LG·April 27, 2026

Atlas-Alignment: Making Interpretability Transferable Across Language Models

Bruno Puri, Jim Berend, Sebastian Lapuschkin, Wojciech Samek

PDF

TL;DR

Atlas-Alignment offers a scalable approach to interpretability by aligning new language models' latent spaces with a pre-existing Concept Atlas, enabling cost-effective transparency and control without extensive retraining.

Contribution

It introduces a lightweight alignment framework that transfers interpretability across models using shared inputs and a pre-existing Concept Atlas, reducing the transparency cost.

Findings

01

Enables semantic retrieval without labeled concept datasets

02

Allows steerable generation across models

03

Reduces interpretability costs significantly

Abstract

Interpretability is crucial for building safe, reliable, and controllable language models, yet existing interpretability pipelines remain costly and difficult to scale. Interpreting a new model typically requires training model-specific components (e.g., sparse autoencoders), followed by manual or semi-automated labeling and validation, imposing a growing "transparency tax" that does not scale with the pace of model development. We introduce Atlas-Alignment, a framework that avoids this cost by aligning the latent space of a new model to a pre-existing, labeled Concept Atlas using only shared inputs and lightweight representational alignment methods. Through quantitative and qualitative evaluations, we show that simple alignment methods enable robust semantic retrieval and steerable generation without the need for labeled concept datasets. Atlas-Alignment thus amortizes the cost of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.