Rejuvenating image-GPT as Strong Visual Representation Learners
Sucheng Ren, Zeyu Wang, Hongru Zhu, Junfei Xiao, Alan Yuille, Cihang, Xie

TL;DR
This paper improves image-GPT by shifting to semantic token prediction and multi-task learning, resulting in a state-of-the-art visual representation model that achieves 90.0% accuracy on ImageNet-1K.
Contribution
It introduces D-iGPT, a novel approach that enhances image-GPT with semantic tokens and visible token prediction, significantly boosting visual representation learning.
Findings
Achieves 90.0% top-1 accuracy on ImageNet-1K with ViT-H.
Outperforms previous models in downstream tasks.
Demonstrates strong generalization capabilities.
Abstract
This paper enhances image-GPT (iGPT), one of the pioneering works that introduce autoregressive pretraining to predict the next pixels for visual representation learning. Two simple yet essential changes are made. First, we shift the prediction target from raw pixels to semantic tokens, enabling a higher-level understanding of visual content. Second, we supplement the autoregressive modeling by instructing the model to predict not only the next tokens but also the visible tokens. This pipeline is particularly effective when semantic tokens are encoded by discriminatively trained models, such as CLIP. We introduce this novel approach as D-iGPT. Extensive experiments showcase that D-iGPT excels as a strong learner of visual representations: A notable achievement is its compelling performance on the ImageNet-1K dataset -- by training on publicly available datasets, D-iGPT unprecedentedly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
