POA: Pre-training Once for Models of All Sizes

Yingying Zhang; Xin Guo; Jiangwei Lao; Lei Yu; Lixiang Ru; and Jian Wang; Guo Ye; Huimei He; Jingdong Chen; Ming Yang

arXiv:2408.01031·cs.CV·August 5, 2024

POA: Pre-training Once for Models of All Sizes

Yingying Zhang, Xin Guo, Jiangwei Lao, Lei Yu, Lixiang Ru, and Jian Wang, Guo Ye, Huimei He, Jingdong Chen, Ming Yang

PDF

Open Access 1 Repo

TL;DR

POA introduces a novel self-distillation framework that enables pre-training a single model to produce multiple models of different sizes simultaneously, improving efficiency and versatility for vision tasks.

Contribution

The paper proposes a tri-branch self-supervised training method with an elastic student, allowing one pre-training session to generate diverse model sizes for various downstream applications.

Findings

01

Achieves state-of-the-art performance on multiple benchmarks.

02

Produces around a hundred models of different sizes from a single pre-training.

03

Effective across various backbones like ViT, Swin Transformer, and ResNet.

Abstract

Large-scale self-supervised pre-training has paved the way for one foundation model to handle many different vision tasks. Most pre-training methodologies train a single model of a certain size at one time. Nevertheless, various computation or storage constraints in real-world scenarios require substantial efforts to develop a series of models with different sizes to deploy. Thus, in this study, we propose a novel tri-branch self-supervised training framework, termed as POA (Pre-training Once for All), to tackle this aforementioned issue. Our approach introduces an innovative elastic student branch into a modern self-distillation paradigm. At each pre-training step, we randomly sample a sub-network from the original student to form the elastic student and train all branches in a self-distilling fashion. Once pre-trained, POA allows the extraction of pre-trained models of diverse sizes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qichuzyy/poa
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSimulation Techniques and Applications · Model-Driven Software Engineering Techniques

MethodsAverage Pooling · Linear Layer · Residual Connection · Multi-Head Attention · Stochastic Depth · Attention Is All You Need · Position-Wise Feed-Forward Layer · Kaiming Initialization · Adam · Byte Pair Encoding