Rejuvenating image-GPT as Strong Visual Representation Learners

Sucheng Ren; Zeyu Wang; Hongru Zhu; Junfei Xiao; Alan Yuille; Cihang; Xie

arXiv:2312.02147·cs.CV·July 8, 2024·1 cites

Rejuvenating image-GPT as Strong Visual Representation Learners

Sucheng Ren, Zeyu Wang, Hongru Zhu, Junfei Xiao, Alan Yuille, Cihang, Xie

PDF

Open Access 4 Repos 1 Video

TL;DR

This paper improves image-GPT by shifting to semantic token prediction and multi-task learning, resulting in a state-of-the-art visual representation model that achieves 90.0% accuracy on ImageNet-1K.

Contribution

It introduces D-iGPT, a novel approach that enhances image-GPT with semantic tokens and visible token prediction, significantly boosting visual representation learning.

Findings

01

Achieves 90.0% top-1 accuracy on ImageNet-1K with ViT-H.

02

Outperforms previous models in downstream tasks.

03

Demonstrates strong generalization capabilities.

Abstract

This paper enhances image-GPT (iGPT), one of the pioneering works that introduce autoregressive pretraining to predict the next pixels for visual representation learning. Two simple yet essential changes are made. First, we shift the prediction target from raw pixels to semantic tokens, enabling a higher-level understanding of visual content. Second, we supplement the autoregressive modeling by instructing the model to predict not only the next tokens but also the visible tokens. This pipeline is particularly effective when semantic tokens are encoded by discriminatively trained models, such as CLIP. We introduce this novel approach as D-iGPT. Extensive experiments showcase that D-iGPT excels as a strong learner of visual representations: A notable achievement is its compelling performance on the ImageNet-1K dataset -- by training on publicly available datasets, D-iGPT unprecedentedly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Rejuvenating image-GPT as Strong Visual Representation Learners· slideslive

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training