Activation Transport Operators
Andrzej Szablewski, Marek Masiak

TL;DR
This paper introduces Activation Transport Operators (ATO), linear maps that analyze how features are linearly transported through residual streams in transformer models, aiding understanding, safety, and debugging of LLMs.
Contribution
The paper proposes ATO, a novel linear operator framework to measure feature transport in residual streams, providing insights into linearity and efficiency in transformer models.
Findings
ATO can identify linearly transported features
Transport efficiency has an established upper bound
Empirical results show significant linear transport in residuals
Abstract
The residual stream mediates communication between transformer decoder layers via linear reads and writes of non-linear computations. While sparse-dictionary learning-based methods locate features in the residual stream, and activation patching methods discover circuits within the model, the mechanism by which features flow through the residual stream remains understudied. Understanding this dynamic can better inform jailbreaking protections, enable early detection of model mistakes, and their correction. In this work, we propose Activation Transport Operators (ATO), linear maps from upstream to downstream residuals layers later, evaluated in feature space using downstream SAE decoder projections. We empirically demonstrate that these operators can determine whether a feature has been linearly transported from a previous layer or synthesised from non-linear layer computation. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
