A Comprehensive Comparison of Pre-training Language Models

Tong Guo

arXiv:2106.11483·cs.CL·July 27, 2023

A Comprehensive Comparison of Pre-training Language Models

Tong Guo

PDF

Open Access 2 Repos

TL;DR

This paper compares various transformer-based pre-trained language models, finding that adding RNN layers offers limited benefits for short text understanding, and emphasizing the effectiveness of data-centric methods.

Contribution

It provides a systematic comparison of pre-trained models with controlled training conditions and highlights the limited gains from architectural modifications like RNN layers.

Findings

01

Adding RNN layers yields minimal improvement for short text understanding.

02

Data-centric methods outperform model architecture modifications.

03

No significant performance difference among similar BERT-based models.

Abstract

Recently, the development of pre-trained language models has brought natural language processing (NLP) tasks to the new state-of-the-art. In this paper we explore the efficiency of various pre-trained language models. We pre-train a list of transformer-based models with the same amount of text and the same training steps. The experimental results shows that the most improvement upon the origin BERT is adding the RNN-layer to capture more contextual information for short text understanding. But the conclusion is: There are no remarkable improvement for short text understanding for similar BERT structures. Data-centric method[12] can achieve better performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Linear Layer · Weight Decay · Adam · Dropout · WordPiece · Layer Normalization · Multi-Head Attention · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay