InversionView: A General-Purpose Method for Reading Information from Neural Activations
Xinting Huang, Madhur Panwar, Navin Goyal, Michael Hahn

TL;DR
InversionView is a versatile method that decodes and visualizes the information encoded in neural activations, enhancing understanding of transformer models' inner workings.
Contribution
It introduces InversionView, a novel approach for inspecting neural activation subsets, enabling detailed analysis of information content in transformer models.
Findings
Reveals token information and positional data from activations
Shows the ability to decode complex abstract knowledge
Provides causally verified circuits confirming decoded information
Abstract
The inner workings of neural networks can be better understood if we can fully decipher the information encoded in neural activations. In this paper, we argue that this information is embodied by the subset of inputs that give rise to similar activations. We propose InversionView, which allows us to practically inspect this subset by sampling from a trained decoder model conditioned on activations. This helps uncover the information content of activation vectors, and facilitates understanding of the algorithms implemented by transformer models. We present four case studies where we investigate models ranging from small transformers to GPT-2. In these studies, we show that InversionView can reveal clear information contained in activations, including basic information about tokens appearing in the context, as well as more complex information, such as the count of certain tokens, their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNeural Networks and Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Layer Normalization · Weight Decay · Attention Dropout · Linear Layer · Linear Warmup With Cosine Annealing · Byte Pair Encoding · Adam · Attention Is All You Need
