RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?
Yuki Tatsunami, Masato Taki

TL;DR
RaftMLP introduces a novel MLP-based architecture that reduces computational complexity and enhances accuracy without relying on attention mechanisms, by incorporating inductive biases and spatial correlations.
Contribution
The paper proposes a new MLP architecture, RaftMLP, that improves accuracy and efficiency by integrating inductive biases and spatial correlations, challenging the dominance of attention-based models.
Findings
RaftMLP achieves comparable accuracy to state-of-the-art models.
The model reduces parameters and computational complexity.
It can serve as a backbone for downstream vision tasks.
Abstract
For the past ten years, CNN has reigned supreme in the world of computer vision, but recently, Transformer has been on the rise. However, the quadratic computational cost of self-attention has become a serious problem in practice applications. There has been much research on architectures without CNN and self-attention in this context. In particular, MLP-Mixer is a simple architecture designed using MLPs and hit an accuracy comparable to the Vision Transformer. However, the only inductive bias in this architecture is the embedding of tokens. This leaves open the possibility of incorporating a non-convolutional (or non-local) inductive bias into the architecture, so we used two simple ideas to incorporate inductive bias into the MLP-Mixer while taking advantage of its ability to capture global correlations. A way is to divide the token-mixing block vertically and horizontally. Another…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Video Surveillance and Tracking Methods · Human Pose and Action Recognition
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Vision Transformer · Label Smoothing
