RaftMLP: How Much Can Be Done Without Attention and with Less Spatial   Locality?

Yuki Tatsunami; Masato Taki

arXiv:2108.04384·cs.CV·January 13, 2023·1 cites

RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?

Yuki Tatsunami, Masato Taki

PDF

Open Access 2 Repos

TL;DR

RaftMLP introduces a novel MLP-based architecture that reduces computational complexity and enhances accuracy without relying on attention mechanisms, by incorporating inductive biases and spatial correlations.

Contribution

The paper proposes a new MLP architecture, RaftMLP, that improves accuracy and efficiency by integrating inductive biases and spatial correlations, challenging the dominance of attention-based models.

Findings

01

RaftMLP achieves comparable accuracy to state-of-the-art models.

02

The model reduces parameters and computational complexity.

03

It can serve as a backbone for downstream vision tasks.

Abstract

For the past ten years, CNN has reigned supreme in the world of computer vision, but recently, Transformer has been on the rise. However, the quadratic computational cost of self-attention has become a serious problem in practice applications. There has been much research on architectures without CNN and self-attention in this context. In particular, MLP-Mixer is a simple architecture designed using MLPs and hit an accuracy comparable to the Vision Transformer. However, the only inductive bias in this architecture is the embedding of tokens. This leaves open the possibility of incorporating a non-convolutional (or non-local) inductive bias into the architecture, so we used two simple ideas to incorporate inductive bias into the MLP-Mixer while taking advantage of its ability to capture global correlations. A way is to divide the token-mixing block vertically and horizontally. Another…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Video Surveillance and Tracking Methods · Human Pose and Action Recognition

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Vision Transformer · Label Smoothing