Transformer Vs. MLP-Mixer: Exponential Expressive Gap For NLP Problems
Dan Navon, Alex M. Bronstein

TL;DR
This paper provides a theoretical analysis showing an exponential expressive power gap between attention mechanisms and MLP-based architectures, explaining their differing performances in NLP and vision tasks.
Contribution
It offers a novel theoretical explanation for the limited performance of MLP-based models in NLP and vision tasks compared to attention mechanisms.
Findings
Exponential expressive power gap between attention and MLP mechanisms.
MLP architectures are weaker in modeling dependencies across multiple inputs.
The performance gap is due to the inherent modeling limitations of MLPs.
Abstract
Vision-Transformers are widely used in various vision tasks. Meanwhile, there is another line of works starting with the MLP-mixer trying to achieve similar performance using mlp-based architectures. Interestingly, until now those mlp-based architectures have not been adapted for NLP tasks. Additionally, until now, mlp-based architectures have failed to achieve state-of-the-art performance in vision tasks. In this paper, we analyze the expressive power of mlp-based architectures in modeling dependencies between multiple different inputs simultaneously, and show an exponential gap between the attention and the mlp-based mechanisms. Our results suggest a theoretical explanation for the mlp inability to compete with attention-based mechanisms in NLP problems, they also suggest that the performance gap in vision tasks may be due to the mlp relative weakness in modeling dependencies between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Constraint Satisfaction and Optimization · Multimodal Machine Learning Applications
MethodsAverage Pooling · Network On Network · Dense Connections · Global Average Pooling · Dropout · Residual Connection · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · MLP-Mixer
