TL;DR
This paper introduces a method to represent large language models as equivalent linear systems, revealing their low-dimensional semantic structures and enabling interpretability without additional training.
Contribution
It presents a novel approach to map LLM inference to an interpretable linear system using detached Jacobians, exposing the models' low-dimensional semantic subspaces.
Findings
Linear mappings reconstruct outputs with near-zero error
LLMs operate in low-dimensional semantic subspaces
Linear representations help interpret and steer model predictions
Abstract
Despite significant progress in transformer interpretability, an understanding of the computational mechanisms of large language models (LLMs) remains a fundamental challenge. Many approaches interpret a network's hidden representations but remain agnostic about how those representations are generated. We address this by mapping LLM inference for a given input sequence to an equivalent and interpretable linear system which reconstructs the predicted output embedding with relative error below at double floating-point precision, requiring no additional model training. We exploit a property of transformers wherein every operation (gated activations, attention, and normalization) can be expressed as , where represents an input-dependent linear transform and preserves the linear pathway. To expose this linear structure, we strategically detach components…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need · Diffusion · LLaMA
