Weight subcloning: direct initialization of transformers using larger   pretrained ones

Mohammad Samragh; Mehrdad Farajtabar; Sachin Mehta; Raviteja; Vemulapalli; Fartash Faghri; Devang Naik; Oncel Tuzel; Mohammad Rastegari

arXiv:2312.09299·cs.LG·December 18, 2023·2 cites

Weight subcloning: direct initialization of transformers using larger pretrained ones

Mohammad Samragh, Mehrdad Farajtabar, Sachin Mehta, Raviteja, Vemulapalli, Fartash Faghri, Devang Naik, Oncel Tuzel, Mohammad Rastegari

PDF

Open Access

TL;DR

This paper presents weight subcloning, a method to initialize smaller transformer models from larger pretrained ones, significantly speeding up training without needing a pretrained model of the exact size.

Contribution

We introduce weight subcloning, a novel technique for transferring knowledge from large to smaller transformers by dimension reduction and layer removal, enabling faster training.

Findings

01

Achieved 4x faster training for vision transformers.

02

Improved training speed for language models.

03

Effective transfer of knowledge across different model sizes.

Abstract

Training large transformer models from scratch for a target task requires lots of data and is computationally demanding. The usual practice of transfer learning overcomes this challenge by initializing the model with weights of a pretrained model of the same size and specification to increase the convergence and training speed. However, what if no pretrained model of the required size is available? In this paper, we introduce a simple yet effective technique to transfer the knowledge of a pretrained model to smaller variants. Our approach called weight subcloning expedites the training of scaled-down transformers by initializing their weights from larger pretrained models. Weight subcloning involves an operation on the pretrained model to obtain the equivalent initialized scaled-down model. It consists of two key steps: first, we introduce neuron importance ranking to decrease the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Digital Media Forensic Detection

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings