Attention-Only Transformers and Implementing MLPs with Attention Heads
Robert Huben, Valerie Morris

TL;DR
This paper demonstrates that MLPs can be implemented using attention heads in transformers, enabling an attention-only architecture with potential increases in attention heads, and shows attention heads can perform MLP components and encode masking patterns.
Contribution
It proves that MLP neurons can be realized by masked attention heads and that attention heads can replicate MLP components and encode masking patterns, enabling attention-only transformers.
Findings
MLP neurons can be implemented by masked attention heads with dimension 1.
Attention heads can perform linear transformations and activation functions of an MLP.
Attention heads can encode arbitrary masking patterns with small error.
Abstract
The transformer architecture is widely used in machine learning models and consists of two alternating sublayers: attention heads and MLPs. We prove that an MLP neuron can be implemented by a masked attention head with internal dimension 1 so long as the MLP's activation function comes from a restricted class including SiLU and close approximations of ReLU and GeLU. This allows one to convert an MLP-and-attention transformer into an attention-only transformer at the cost of greatly increasing the number of attention heads. We also prove that attention heads can perform the components of an MLP (linear transformations and activation functions) separately. Finally, we prove that attention heads can encode arbitrary masking patterns in their weight matrices to within arbitrarily small error.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Advanced Neural Network Applications · Machine Learning and ELM
MethodsSigmoid Linear Unit
