PairConnect: A Compute-Efficient MLP Alternative to Attention

Zhaozhuo Xu; Minghao Yan; Junyan Zhang; Anshumali Shrivastava

arXiv:2106.08235·cs.LG·June 16, 2021·1 cites

PairConnect: A Compute-Efficient MLP Alternative to Attention

Zhaozhuo Xu, Minghao Yan, Junyan Zhang, Anshumali Shrivastava

PDF

Open Access

TL;DR

PairConnect is a novel MLP-based model that models pairwise word interactions efficiently, offering comparable performance to Transformers with significantly reduced inference computational costs.

Contribution

It introduces PairConnect, an MLP alternative to Transformer attention, explicitly modeling pairwise interactions with improved compute efficiency and greater expressiveness.

Findings

01

Achieves similar language modeling performance as Transformer.

02

Reduces inference computational cost significantly.

03

Mathematically more expressive than Transformer despite being an MLP.

Abstract

Transformer models have demonstrated superior performance in natural language processing. The dot product self-attention in Transformer allows us to model interactions between words. However, this modeling comes with significant computational overhead. In this work, we revisit the memory-compute trade-off associated with Transformer, particularly multi-head attention, and show a memory-heavy but significantly more compute-efficient alternative to Transformer. Our proposal, denoted as PairConnect, a multilayer perceptron (MLP), models the pairwise interaction between words by explicit pairwise word embeddings. As a result, PairConnect substitutes self dot product with a simple embedding lookup. We show mathematically that despite being an MLP, our compute-efficient PairConnect is strictly more expressive than Transformer. Our experiment on language modeling tasks suggests that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Layer Normalization · Multi-Head Attention · Label Smoothing · Residual Connection