PairConnect: A Compute-Efficient MLP Alternative to Attention
Zhaozhuo Xu, Minghao Yan, Junyan Zhang, Anshumali Shrivastava

TL;DR
PairConnect is a novel MLP-based model that models pairwise word interactions efficiently, offering comparable performance to Transformers with significantly reduced inference computational costs.
Contribution
It introduces PairConnect, an MLP alternative to Transformer attention, explicitly modeling pairwise interactions with improved compute efficiency and greater expressiveness.
Findings
Achieves similar language modeling performance as Transformer.
Reduces inference computational cost significantly.
Mathematically more expressive than Transformer despite being an MLP.
Abstract
Transformer models have demonstrated superior performance in natural language processing. The dot product self-attention in Transformer allows us to model interactions between words. However, this modeling comes with significant computational overhead. In this work, we revisit the memory-compute trade-off associated with Transformer, particularly multi-head attention, and show a memory-heavy but significantly more compute-efficient alternative to Transformer. Our proposal, denoted as PairConnect, a multilayer perceptron (MLP), models the pairwise interaction between words by explicit pairwise word embeddings. As a result, PairConnect substitutes self dot product with a simple embedding lookup. We show mathematically that despite being an MLP, our compute-efficient PairConnect is strictly more expressive than Transformer. Our experiment on language modeling tasks suggests that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Layer Normalization · Multi-Head Attention · Label Smoothing · Residual Connection
