Effective Pre-Training Objectives for Transformer-based Autoencoders

Luca Di Liello; Matteo Gabburo; Alessandro Moschitti

arXiv:2210.13536·cs.CL·October 26, 2022

Effective Pre-Training Objectives for Transformer-based Autoencoders

Luca Di Liello, Matteo Gabburo, Alessandro Moschitti

PDF

Open Access

TL;DR

This paper explores efficient pre-training objectives for Transformer encoders, proposing lighter alternatives to existing methods that reduce computational cost while maintaining performance.

Contribution

It introduces new pre-training approaches combining features of common objectives and designs lightweight token generators to replace heavy ones like ELECTRA.

Findings

01

Light token generators significantly reduce pre-training cost.

02

Alternative objectives outperform BERT's MLM in efficiency.

03

Light pre-training approaches maintain competitive accuracy.

Abstract

In this paper, we study trade-offs between efficiency, cost and accuracy when pre-training Transformer encoders with different pre-training objectives. For this purpose, we analyze features of common objectives and combine them to create new effective pre-training approaches. Specifically, we designed light token generators based on a straightforward statistical approach, which can replace ELECTRA computationally heavy generators, thus highly reducing cost. Our experiments also show that (i) there are more efficient alternatives to BERT's MLM, and (ii) it is possible to efficiently pre-train Transformer-based models using lighter generators without a significant drop in performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Music and Audio Processing · Anomaly Detection Techniques and Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Weight Decay · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · WordPiece · Linear Warmup With Linear Decay