Vision Permutator: A Permutable MLP-Like Architecture for Visual   Recognition

Qibin Hou; Zihang Jiang; Li Yuan; Ming-Ming Cheng; Shuicheng Yan,; Jiashi Feng

arXiv:2106.12368·cs.CV·June 24, 2021·25 cites

Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition

Qibin Hou, Zihang Jiang, Li Yuan, Ming-Ming Cheng, Shuicheng Yan,, Jiashi Feng

PDF

Open Access 3 Repos

TL;DR

Vision Permutator introduces a simple, data-efficient MLP-like architecture that encodes positional information along height and width separately, achieving competitive accuracy on ImageNet without complex mechanisms.

Contribution

It proposes a novel way to encode spatial information in MLP-like models by separately permuting height and width, improving efficiency and accuracy.

Findings

01

Achieves 81.5% top-1 accuracy on ImageNet with 25M parameters.

02

Outperforms many CNNs and vision transformers of similar size.

03

Scales to 88M parameters with 83.2% accuracy.

Abstract

In this paper, we present Vision Permutator, a conceptually simple and data efficient MLP-like architecture for visual recognition. By realizing the importance of the positional information carried by 2D feature representations, unlike recent MLP-like models that encode the spatial information along the flattened spatial dimensions, Vision Permutator separately encodes the feature representations along the height and width dimensions with linear projections. This allows Vision Permutator to capture long-range dependencies along one spatial direction and meanwhile preserve precise positional information along the other direction. The resulting position-sensitive outputs are then aggregated in a mutually complementing manner to form expressive representations of the objects of interest. We show that our Vision Permutators are formidable competitors to convolutional neural networks (CNNs)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning