Semantic-Aware Autoregressive Image Modeling for Visual Representation Learning
Kaiyou Song, Shan Zhang, Tong Wang

TL;DR
This paper introduces SemAIM, a semantic-aware autoregressive image modeling approach that improves self-supervised visual representation learning by modeling images from semantic to less semantic patches, achieving state-of-the-art results.
Contribution
SemAIM proposes a novel permutation-based autoregressive modeling method guided by semantic similarity, enhancing high-level feature learning in images.
Findings
Achieves 84.1% top-1 accuracy on ImageNet with ViT-B.
Outperforms vanilla MAE in object detection and segmentation tasks.
Demonstrates state-of-the-art performance across multiple downstream tasks.
Abstract
The development of autoregressive modeling (AM) in computer vision lags behind natural language processing (NLP) in self-supervised pre-training. This is mainly caused by the challenge that images are not sequential signals and lack a natural order when applying autoregressive modeling. In this study, inspired by human beings' way of grasping an image, i.e., focusing on the main object first, we present a semantic-aware autoregressive image modeling (SemAIM) method to tackle this challenge. The key insight of SemAIM is to autoregressive model images from the semantic patches to the less semantic patches. To this end, we first calculate a semantic-aware permutation of patches according to their feature similarities and then perform the autoregression procedure based on the permutation. In addition, considering that the raw pixels of patches are low-level signals and are not ideal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Neural Network Applications
MethodsMasked autoencoder
