PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

Xiaoyi Dong; Jianmin Bao; Ting Zhang; Dongdong Chen and; Weiming Zhang; Lu Yuan; Dong Chen; Fang Wen; Nenghai Yu and; Baining Guo

arXiv:2111.12710·cs.CV·December 19, 2022·32 cites

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen and, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu and, Baining Guo

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces PeCo, a perceptual codebook for BERT pre-training of vision transformers, which aligns prediction targets with human perception, leading to improved semantic understanding and transfer performance.

Contribution

It proposes a perceptual prediction target learned through enforcing perceptual similarity during dVAE training, enhancing the semantic quality of visual tokens for vision transformer pre-training.

Findings

01

Achieves 84.5% Top-1 accuracy on ImageNet-1K with ViT-B, surpassing BEiT by 1.3%.

02

Improves object detection and segmentation results on COCO and ADE20K.

03

Sets state-of-the-art 88.3% accuracy with ViT-H using only ImageNet-1K data.

Abstract

This paper explores a better prediction target for BERT pre-training of vision transformers. We observe that current prediction targets disagree with human perception judgment.This contradiction motivates us to learn a perceptual prediction target. We argue that perceptually similar images should stay close to each other in the prediction target space. We surprisingly find one simple yet effective idea: enforcing perceptual similarity during the dVAE training. Moreover, we adopt a self-supervised transformer model for deep feature extraction and show that it works well for calculating perceptual similarity.We demonstrate that such learned visual tokens indeed exhibit better semantic meanings, and help pre-training achieve superior transfer performance in various downstream tasks. For example, we achieve $84.5%$ Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xyzforever/bevt
pytorch

Videos

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers· underline

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Weight Decay · Dense Connections · Refunds@Expedia|||How do I get a full refund from Expedia? · Softmax · Linear Warmup With Linear Decay · Residual Connection