Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition
Qibin Hou, Zihang Jiang, Li Yuan, Ming-Ming Cheng, Shuicheng Yan,, Jiashi Feng

TL;DR
Vision Permutator introduces a simple, data-efficient MLP-like architecture that encodes positional information along height and width separately, achieving competitive accuracy on ImageNet without complex mechanisms.
Contribution
It proposes a novel way to encode spatial information in MLP-like models by separately permuting height and width, improving efficiency and accuracy.
Findings
Achieves 81.5% top-1 accuracy on ImageNet with 25M parameters.
Outperforms many CNNs and vision transformers of similar size.
Scales to 88M parameters with 83.2% accuracy.
Abstract
In this paper, we present Vision Permutator, a conceptually simple and data efficient MLP-like architecture for visual recognition. By realizing the importance of the positional information carried by 2D feature representations, unlike recent MLP-like models that encode the spatial information along the flattened spatial dimensions, Vision Permutator separately encodes the feature representations along the height and width dimensions with linear projections. This allows Vision Permutator to capture long-range dependencies along one spatial direction and meanwhile preserve precise positional information along the other direction. The resulting position-sensitive outputs are then aggregated in a mutually complementing manner to form expressive representations of the objects of interest. We show that our Vision Permutators are formidable competitors to convolutional neural networks (CNNs)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
