Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers
Gabriel Smithline, Chris Mascioli

TL;DR
This paper investigates how different feedforward network architectures in small Transformers influence the distribution of computational roles between FFN and attention mechanisms, revealing that sparsity can shift computation and affect interpretability.
Contribution
It demonstrates that architectural sparsity in FFNs redistributes computation within Transformers, largely driven by design choices rather than learned routing, and explores implications for interpretability.
Findings
Sparse MoE routing shifts computation from FFN to attention.
Frozen random routing nearly matches learned routing effects.
GLU gating redistributes Fourier structure, affecting interpretability.
Abstract
Architectural choices inside the Transformer feedforward network (FFN) block do not merely affect the block itself; they reshape the computations learned by the rest of the model. We study this effect in one-layer Transformers trained on digit addition with carry, modular arithmetic, and histogram counting. Comparing dense FFNs, gated linear units (GLUs), mixture-of-experts (MoE), and MoE-GLUs, we find that sparse MoE routing can shift computation from FFN to attention, with the strongest ablation-visible effect on carry-based addition. We decompose this redistribution into reduced per-token FFN capacity and sparse partitioning across experts. Critically, frozen random routing nearly matches learned routing, suggesting that redistribution is driven largely by architectural sparsity rather than router-learned specialization. As a secondary finding, GLU-style multiplicative gating rotates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
