POA: Pre-training Once for Models of All Sizes
Yingying Zhang, Xin Guo, Jiangwei Lao, Lei Yu, Lixiang Ru, and Jian Wang, Guo Ye, Huimei He, Jingdong Chen, Ming Yang

TL;DR
POA introduces a novel self-distillation framework that enables pre-training a single model to produce multiple models of different sizes simultaneously, improving efficiency and versatility for vision tasks.
Contribution
The paper proposes a tri-branch self-supervised training method with an elastic student, allowing one pre-training session to generate diverse model sizes for various downstream applications.
Findings
Achieves state-of-the-art performance on multiple benchmarks.
Produces around a hundred models of different sizes from a single pre-training.
Effective across various backbones like ViT, Swin Transformer, and ResNet.
Abstract
Large-scale self-supervised pre-training has paved the way for one foundation model to handle many different vision tasks. Most pre-training methodologies train a single model of a certain size at one time. Nevertheless, various computation or storage constraints in real-world scenarios require substantial efforts to develop a series of models with different sizes to deploy. Thus, in this study, we propose a novel tri-branch self-supervised training framework, termed as POA (Pre-training Once for All), to tackle this aforementioned issue. Our approach introduces an innovative elastic student branch into a modern self-distillation paradigm. At each pre-training step, we randomly sample a sub-network from the original student to form the elastic student and train all branches in a self-distilling fashion. Once pre-trained, POA allows the extraction of pre-trained models of diverse sizes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation Techniques and Applications · Model-Driven Software Engineering Techniques
MethodsAverage Pooling · Linear Layer · Residual Connection · Multi-Head Attention · Stochastic Depth · Attention Is All You Need · Position-Wise Feed-Forward Layer · Kaiming Initialization · Adam · Byte Pair Encoding
