Rethinking Hierarchies in Pre-trained Plain Vision Transformer

Yufei Xu; Jing Zhang; Qiming Zhang; Dacheng Tao

arXiv:2211.01785·cs.CV·November 9, 2022

Rethinking Hierarchies in Pre-trained Plain Vision Transformer

Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao

PDF

Open Access

TL;DR

This paper proposes a simple method to convert plain Vision Transformers into hierarchical models with minimal changes, reducing pre-training costs and improving performance across multiple vision tasks.

Contribution

It introduces a minimal modification approach to create hierarchical ViTs from plain ViTs, avoiding extensive pre-training and leveraging existing checkpoints.

Findings

01

Outperforms plain ViT in classification, detection, and segmentation tasks.

02

Reduces computational cost by avoiding pre-training hierarchical ViTs.

03

Achieves better results with minimal architectural changes.

Abstract

Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective. However, customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT. More importantly, since these hierarchical ViTs cannot reuse the off-the-shelf pre-trained weights of the plain ViTs, the requirement of pre-training them leads to a massive amount of computational cost, thereby incurring both algorithmic and computational complexity. In this paper, we address this problem by proposing a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training. We transform the plain ViT into a hierarchical one with minimal changes. Technically, we change the stride of linear embedding layer from 16 to 4 and add convolution (or simple average)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Advanced Neural Network Applications · Advanced Memory and Neural Computing

MethodsMulti-Head Attention · Attention Is All You Need · Masked autoencoder · Linear Layer · Softmax · Dense Connections · Layer Normalization · Residual Connection · Convolution · Vision Transformer