TL;DR
WriteSAE introduces a sparse autoencoder that learns rank-1 matrix atoms to replace recurrent model writes, significantly improving token distribution accuracy and enabling cache-level steering in language models.
Contribution
The paper presents WriteSAE, a novel sparse autoencoder that directly replaces matrix updates in recurrent language models, enabling improved interpretability and control.
Findings
Atoms give closer final token distributions in 92.4% of positions
High predictive accuracy with R^2=0.98 for logit change formula
Generation steering increases token appearance in continuations from 33.3% to 100%
Abstract
We introduce WriteSAE, a sparse autoencoder for the matrix updates written into recurrent language-model state. In Gated DeltaNet, Mamba-2, and RWKV-7, each token writes a matrix-shaped update to a recurrent cache; a residual-stream SAE has vector-shaped atoms and cannot replace that update directly. WriteSAE learns rank-1 matrix atoms with the same shape as the model's own write. This lets us test a direct replacement: at positions where the SAE activates an atom, we remove the model's write, insert the atom scaled by its SAE activation, and continue the forward pass. The atom gives a closer final token distribution than deleting the write on 92.4% of evaluated positions; averaged per atom, the rate is 89.8%. For Gated DeltaNet, a formula using the forget gate, read query, and output embedding predicts the resulting logit change with . The same replacement test transfers to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
