On the Transferability of Pre-trained Language Models: A Study from   Artificial Datasets

Cheng-Han Chiang; Hung-yi Lee

arXiv:2109.03537·cs.CL·February 21, 2022

On the Transferability of Pre-trained Language Models: A Study from Artificial Datasets

Cheng-Han Chiang, Hung-yi Lee

PDF

Open Access 1 Repo 1 Video

TL;DR

This study investigates which traits in artificial pre-training datasets contribute to the transferability of language models, revealing that explicit and implicit token dependencies are crucial for downstream performance.

Contribution

The paper introduces a controlled experimental framework using artificial datasets to isolate and analyze the impact of specific data traits on language model transferability.

Findings

01

Explicit token dependencies are critical for downstream performance.

02

Longer implicit dependencies improve transferability.

03

Pre-trained models can transfer knowledge even without natural language data.

Abstract

Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance than their counterparts directly trained on the downstream tasks. In this work, we study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks. We propose to use artificially constructed datasets as the pre-training data to exclude the effect of semantics, and further control what characteristics the pre-training corpora have. By fine-tuning the pre-trained models on GLUE benchmark, we can learn how beneficial it is to transfer the knowledge from the model trained on the dataset possessing that specific trait. We define and discuss three different characteristics in the artificial dataset: 1) matching the token's uni-gram or bi-gram…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

d223302/Transformer-Structure
pytorchOfficial

Videos

On the Transferability of Pre-Trained Language Models: A Study from Artificial Datasets· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification