InverseScope: Scalable Activation Inversion for Interpreting Large Language Models

Yifan Luo; Zhennan Zhou; Bin Dong

arXiv:2506.07406·cs.LG·September 30, 2025

InverseScope: Scalable Activation Inversion for Interpreting Large Language Models

Yifan Luo, Zhennan Zhou, Bin Dong

PDF

Open Access 3 Reviews

TL;DR

InverseScope introduces a scalable, assumption-light input inversion framework for interpreting large language models' internal activations, enabling systematic analysis of their representations in practical settings.

Contribution

It proposes a novel, efficient input inversion method with a new generation architecture and evaluation protocol, advancing interpretability of large models.

Findings

01

Improved sample efficiency over previous methods

02

Scalable analysis of large language models

03

Quantitative evaluation of interpretability hypotheses

Abstract

Understanding the internal representations of large language models (LLMs) is a central challenge in interpretability research. Existing feature interpretability methods often rely on strong assumptions about the structure of representations that may not hold in practice. In this work, we introduce InverseScope, an assumption-light and scalable framework for interpreting neural activations via input inversion. Given a target activation, we define a distribution over inputs that generate similar activations and analyze this distribution to infer the encoded information. To address the inefficiency of sampling in high-dimensional spaces, we propose a novel conditional generation architecture that significantly improves sample efficiency compared to previous method. We further introduce a quantitative evaluation protocol that tests interpretability hypotheses using the feature consistency…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

The paper tackles a somewhat under-investigated question in the interpretability literature overall: can we have an "oracle" that simply tells us what features of the input text a given activation "cares about", without relying on assumptions like linearity or sparse coding? The work makes the assumption that activation semantics is "continuous", which seems reasonable as far as assumptions go. The writing is clear & easy to follow.

Weaknesses

- The contribution over the prior work [1] is relatively incremental. The prior work also trains an activation inverter. The main contribution of the current work is not in methodology, but in the kernel used for approximate inversion and in the experiments. - The approximate activation inversion process complicates and obfuscates the method, as it introduces hyperparameters (the noise scale and the "width" of the kernel) with an unknown role in the final results. Additionally, the inversion onl

Reviewer 02Rating 8Confidence 3

Strengths

- Proposes a novel framework for interpreting LLM internals - Operationalizes the framework with a conditional generation architecture - Results on IOI and RAVEL show the method is promising and more accurate compared to SAE-based alternatives - Interesting analysis sheds light where task-specific features from ICL are encoded within the model

Weaknesses

- It would be interesting to also show qualitative samples of reconstructed inputs, as well as failure cases. - I think it is too strong to consider a conditional LM as a standalone contribution, as such architectures have been widely used prior. - The related works section feels quite thin. Namely, conditional generation based on model latents, and the connections to the perspective of variational autoencoders seem relevant - but this is quite minor.

Reviewer 03Rating 6Confidence 2

Strengths

The proposed method is more sample efficient than other inversion methods. The proposed method correctly identifies some of the attention heads that are important for the IOI circuit in GPT2. The accuracy on the classification task on the RAVEL benchmark surpasses that of SAEs.

Weaknesses

The 'inverting' model has to be re-trained for each specific task. Although this can be said of several interpretability methods, it is not clear here if the latent representations have any causal link to the mechanisms of the underlying model and it is not obvious how to test the predictions made. On the IOI task, the 'ground' truth heads were correctly identified by the consistency rate, but there were also other several heads that had similar consistency. In a world where 'ground' truth la

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Multimodal Machine Learning Applications