On Initializing Transformers with Pre-trained Embeddings

Ha Young Kim; Niranjan Balasubramanian; Byungkon Kang

arXiv:2407.12514·cs.CL·July 18, 2024

On Initializing Transformers with Pre-trained Embeddings

Ha Young Kim, Niranjan Balasubramanian, Byungkon Kang

PDF

Open Access

TL;DR

This paper investigates why pre-trained embeddings sometimes underperform random initialization in transformer models, identifying factors like value distribution and interaction with position encodings, and proposes standardization to improve results.

Contribution

It reveals the impact of embedding value distribution on transformer training and demonstrates that standardizing pre-trained embeddings enhances their effectiveness.

Findings

01

Standardizing embeddings improves performance of GloVe, T5, and mT5.

02

BERT embeddings perform better due to their closer value range to Xavier initialization.

03

Embedding interactions with position encodings affect model training stability.

Abstract

It has become common practice now to use random initialization schemes, rather than the pre-trained embeddings, when training transformer based models from scratch. Indeed, we find that pre-trained word embeddings from GloVe, and some sub-word embeddings extracted from language models such as T5 and mT5 fare much worse compared to random initialization. This is counter-intuitive given the well-known representational and transfer-learning advantages of pre-training. Interestingly, we also find that BERT and mBERT embeddings fare better than random initialization, showing the advantages of pre-trained representations. In this work, we posit two potential factors that contribute to these mixed results: the model sensitivity to parameter distribution and the embedding interactions with position encodings. We observe that pre-trained GloVe, T5, and mT5 embeddings have a wider distribution of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsGated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · SentencePiece · Residual Connection · Layer Normalization · Xavier Initialization · Linear Layer · Adafactor