Rethinking Hierarchies in Pre-trained Plain Vision Transformer
Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao

TL;DR
This paper proposes a simple method to convert plain Vision Transformers into hierarchical models with minimal changes, reducing pre-training costs and improving performance across multiple vision tasks.
Contribution
It introduces a minimal modification approach to create hierarchical ViTs from plain ViTs, avoiding extensive pre-training and leveraging existing checkpoints.
Findings
Outperforms plain ViT in classification, detection, and segmentation tasks.
Reduces computational cost by avoiding pre-training hierarchical ViTs.
Achieves better results with minimal architectural changes.
Abstract
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective. However, customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT. More importantly, since these hierarchical ViTs cannot reuse the off-the-shelf pre-trained weights of the plain ViTs, the requirement of pre-training them leads to a massive amount of computational cost, thereby incurring both algorithmic and computational complexity. In this paper, we address this problem by proposing a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training. We transform the plain ViT into a hierarchical one with minimal changes. Technically, we change the stride of linear embedding layer from 16 to 4 and add convolution (or simple average)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Advanced Neural Network Applications · Advanced Memory and Neural Computing
MethodsMulti-Head Attention · Attention Is All You Need · Masked autoencoder · Linear Layer · Softmax · Dense Connections · Layer Normalization · Residual Connection · Convolution · Vision Transformer
