A Comprehensive Comparison of Pre-training Language Models
Tong Guo

TL;DR
This paper compares various transformer-based pre-trained language models, finding that adding RNN layers offers limited benefits for short text understanding, and emphasizing the effectiveness of data-centric methods.
Contribution
It provides a systematic comparison of pre-trained models with controlled training conditions and highlights the limited gains from architectural modifications like RNN layers.
Findings
Adding RNN layers yields minimal improvement for short text understanding.
Data-centric methods outperform model architecture modifications.
No significant performance difference among similar BERT-based models.
Abstract
Recently, the development of pre-trained language models has brought natural language processing (NLP) tasks to the new state-of-the-art. In this paper we explore the efficiency of various pre-trained language models. We pre-train a list of transformer-based models with the same amount of text and the same training steps. The experimental results shows that the most improvement upon the origin BERT is adding the RNN-layer to capture more contextual information for short text understanding. But the conclusion is: There are no remarkable improvement for short text understanding for similar BERT structures. Data-centric method[12] can achieve better performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Linear Layer · Weight Decay · Adam · Dropout · WordPiece · Layer Normalization · Multi-Head Attention · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay
