Attention-Only Transformers and Implementing MLPs with Attention Heads

Robert Huben; Valerie Morris

arXiv:2309.08593·cs.LG·September 18, 2023

Attention-Only Transformers and Implementing MLPs with Attention Heads

Robert Huben, Valerie Morris

PDF

Open Access

TL;DR

This paper demonstrates that MLPs can be implemented using attention heads in transformers, enabling an attention-only architecture with potential increases in attention heads, and shows attention heads can perform MLP components and encode masking patterns.

Contribution

It proves that MLP neurons can be realized by masked attention heads and that attention heads can replicate MLP components and encode masking patterns, enabling attention-only transformers.

Findings

01

MLP neurons can be implemented by masked attention heads with dimension 1.

02

Attention heads can perform linear transformations and activation functions of an MLP.

03

Attention heads can encode arbitrary masking patterns with small error.

Abstract

The transformer architecture is widely used in machine learning models and consists of two alternating sublayers: attention heads and MLPs. We prove that an MLP neuron can be implemented by a masked attention head with internal dimension 1 so long as the MLP's activation function comes from a restricted class including SiLU and close approximations of ReLU and GeLU. This allows one to convert an MLP-and-attention transformer into an attention-only transformer at the cost of greatly increasing the number of attention heads. We also prove that attention heads can perform the components of an MLP (linear transformations and activation functions) separately. Finally, we prove that attention heads can encode arbitrary masking patterns in their weight matrices to within arbitrarily small error.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Advanced Neural Network Applications · Machine Learning and ELM

MethodsSigmoid Linear Unit