Equivalence of Context and Parameter Updates in Modern Transformer Blocks
Adrian Goldwaser, Michael Munn, Javier Gonzalvo, Benoit Dherin

TL;DR
This paper demonstrates that in modern transformer architectures, the effects of context can be exactly represented by implicit rank-1 patches to MLP weights, unifying various models under a common theoretical framework.
Contribution
It extends foundational theory to complex LLM architectures, providing a constructive proof and a general framework for understanding implicit weight updates.
Findings
Exact mapping of context effects to rank-1 weight patches in transformer blocks
General framework applicable to diverse LLM architectures
Theoretical proof of perfect implicit weight patches under controllability conditions
Abstract
Recent research has established that the impact of context in a vanilla transformer can be represented implicitly by forming a token-dependent, rank-1 patch to its MLP weights. This work extends that foundational theory to the diverse architectures of modern Large Language Models. We first demonstrate a precise, analytical solution for a Gemma-style transformer block, proving that the entire effect of a context can be perfectly mapped to rank-1 patches on its MLP weight matrices and a patch to the RMSNorm scale. We then generalize this result, providing a constructive proof and algorithm for multi-layer models. To unify these findings, we introduce a general framework centered on two core properties: input controllability and output controllability. We prove that a perfect implicit weight patch is possible for any MLP block where the inner function is input-controllable and the outer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Advanced Graph Neural Networks · Parallel Computing and Optimization Techniques
