Maximizing Efficiency of Language Model Pre-training for Learning   Representation

Junmo Kang; Suwon Shin; Jeonghwan Kim; Jaeyoung Jo; Sung-Hyon Myaeng

arXiv:2110.06620·cs.CL·October 14, 2021

Maximizing Efficiency of Language Model Pre-training for Learning Representation

Junmo Kang, Suwon Shin, Jeonghwan Kim, Jaeyoung Jo, Sung-Hyon Myaeng

PDF

Open Access

TL;DR

This paper proposes an adaptive early exit strategy to enhance the compute efficiency of ELECTRA pre-training by utilizing earlier layer representations, and investigates the necessity of the generator module for maintaining accuracy.

Contribution

It introduces an adaptive early exit method for ELECTRA pre-training and evaluates the generator module's role in balancing efficiency and accuracy.

Findings

01

Early exit improves training efficiency without significant accuracy loss.

02

The generator module's necessity is context-dependent and can be reduced.

03

The approach achieves faster pre-training with comparable performance.

Abstract

Pre-trained language models in the past years have shown exponential growth in model parameters and compute time. ELECTRA is a novel approach for improving the compute efficiency of pre-trained language models (e.g. BERT) based on masked language modeling (MLM) by addressing the sample inefficiency problem with the replaced token detection (RTD) task. Our work proposes adaptive early exit strategy to maximize the efficiency of the pre-training process by relieving the model's subsequent layers of the need to process latent features by leveraging earlier layer representations. Moreover, we evaluate an initial approach to the problem that has not succeeded in maintaining the accuracy of the model while showing a promising compute efficiency by thoroughly investigating the necessity of the generator module of ELECTRA.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Softmax · Dropout · Layer Normalization · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay