On the Transferability of Pre-trained Language Models: A Study from Artificial Datasets
Cheng-Han Chiang, Hung-yi Lee

TL;DR
This study investigates which traits in artificial pre-training datasets contribute to the transferability of language models, revealing that explicit and implicit token dependencies are crucial for downstream performance.
Contribution
The paper introduces a controlled experimental framework using artificial datasets to isolate and analyze the impact of specific data traits on language model transferability.
Findings
Explicit token dependencies are critical for downstream performance.
Longer implicit dependencies improve transferability.
Pre-trained models can transfer knowledge even without natural language data.
Abstract
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance than their counterparts directly trained on the downstream tasks. In this work, we study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks. We propose to use artificially constructed datasets as the pre-training data to exclude the effect of semantics, and further control what characteristics the pre-training corpora have. By fine-tuning the pre-trained models on GLUE benchmark, we can learn how beneficial it is to transfer the knowledge from the model trained on the dataset possessing that specific trait. We define and discuss three different characteristics in the artificial dataset: 1) matching the token's uni-gram or bi-gram…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
