InverseScope: Scalable Activation Inversion for Interpreting Large Language Models
Yifan Luo, Zhennan Zhou, Bin Dong

TL;DR
InverseScope introduces a scalable, assumption-light input inversion framework for interpreting large language models' internal activations, enabling systematic analysis of their representations in practical settings.
Contribution
It proposes a novel, efficient input inversion method with a new generation architecture and evaluation protocol, advancing interpretability of large models.
Findings
Improved sample efficiency over previous methods
Scalable analysis of large language models
Quantitative evaluation of interpretability hypotheses
Abstract
Understanding the internal representations of large language models (LLMs) is a central challenge in interpretability research. Existing feature interpretability methods often rely on strong assumptions about the structure of representations that may not hold in practice. In this work, we introduce InverseScope, an assumption-light and scalable framework for interpreting neural activations via input inversion. Given a target activation, we define a distribution over inputs that generate similar activations and analyze this distribution to infer the encoded information. To address the inefficiency of sampling in high-dimensional spaces, we propose a novel conditional generation architecture that significantly improves sample efficiency compared to previous method. We further introduce a quantitative evaluation protocol that tests interpretability hypotheses using the feature consistency…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper tackles a somewhat under-investigated question in the interpretability literature overall: can we have an "oracle" that simply tells us what features of the input text a given activation "cares about", without relying on assumptions like linearity or sparse coding? The work makes the assumption that activation semantics is "continuous", which seems reasonable as far as assumptions go. The writing is clear & easy to follow.
- The contribution over the prior work [1] is relatively incremental. The prior work also trains an activation inverter. The main contribution of the current work is not in methodology, but in the kernel used for approximate inversion and in the experiments. - The approximate activation inversion process complicates and obfuscates the method, as it introduces hyperparameters (the noise scale and the "width" of the kernel) with an unknown role in the final results. Additionally, the inversion onl
- Proposes a novel framework for interpreting LLM internals - Operationalizes the framework with a conditional generation architecture - Results on IOI and RAVEL show the method is promising and more accurate compared to SAE-based alternatives - Interesting analysis sheds light where task-specific features from ICL are encoded within the model
- It would be interesting to also show qualitative samples of reconstructed inputs, as well as failure cases. - I think it is too strong to consider a conditional LM as a standalone contribution, as such architectures have been widely used prior. - The related works section feels quite thin. Namely, conditional generation based on model latents, and the connections to the perspective of variational autoencoders seem relevant - but this is quite minor.
The proposed method is more sample efficient than other inversion methods. The proposed method correctly identifies some of the attention heads that are important for the IOI circuit in GPT2. The accuracy on the classification task on the RAVEL benchmark surpasses that of SAEs.
The 'inverting' model has to be re-trained for each specific task. Although this can be said of several interpretability methods, it is not clear here if the latent representations have any causal link to the mechanisms of the underlying model and it is not obvious how to test the predictions made. On the IOI task, the 'ground' truth heads were correctly identified by the consistency rate, but there were also other several heads that had similar consistency. In a world where 'ground' truth la
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Multimodal Machine Learning Applications
