Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space
Valeria Ruscio, Eli-Shaoul Khedouri, Keiran Thompson

TL;DR
This paper investigates the asymmetric effects of pretraining and alignment on transformer weights, revealing distinct geometric traces and their underlying causes through empirical and theoretical analysis.
Contribution
It characterizes the geometric asymmetry in transformer weights caused by pretraining and alignment, explaining it via anisotropic gradient accumulation and providing causal evidence.
Findings
Alignment updates concentrate in the read pathway ($W_Q$, $W_K$).
Pretraining induces prediction geometry in the write pathway ($W_O$, $W_2$).
Gradient anisotropy explains the observed weight-space patterns.
Abstract
Cross-entropy pretraining and preference alignment update the same transformer weights, but leave geometrically distinct traces. We characterise this asymmetry with a relative-subspace-fraction probe that tracks how weight deltas align with residual-stream activation subspaces and with the prediction subspace defined by the unembedding. Alignment deltas concentrate in the read pathway (, ), along principal directions of attention-input activations, while remaining near-isotropic in the write pathway (, ) relative to the prediction subspace. We explain this pattern through anisotropic gradient accumulation: updates to a matrix are sums of outer products , and inherit directional structure from whichever side has concentrated covariance. For read-pathway matrices, this side is the input activation , whose covariance is spiked in trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
