Studying the impacts of pre-training using ChatGPT-generated text on   downstream tasks

Sarthak Anand

arXiv:2309.05668·cs.CL·September 13, 2023

Studying the impacts of pre-training using ChatGPT-generated text on downstream tasks

Sarthak Anand

PDF

Open Access

TL;DR

This study investigates whether using ChatGPT-generated text in pre-training affects language model performance and bias, finding no significant impact on downstream task results or gender bias.

Contribution

It provides the first empirical analysis of the effects of artificial, LLM-generated text in pre-training on downstream task performance and bias.

Findings

01

Artificial text in pre-training does not significantly affect downstream task performance.

02

Pre-training with ChatGPT-generated text does not increase gender bias.

03

Results suggest robustness of models to training data source variations.

Abstract

In recent times, significant advancements have been witnessed in the field of language models, particularly with the emergence of Large Language Models (LLMs) that are trained on vast amounts of data extracted from internet archives. These LLMs, such as ChatGPT, have become widely accessible, allowing users to generate text for various purposes including articles, essays, jokes, and poetry. Given that LLMs are trained on a diverse range of text sources, encompassing platforms like Reddit and Twitter, it is foreseeable that future training datasets will also incorporate text generated by previous iterations of the models themselves. In light of this development, our research aims to investigate the influence of artificial text in the pre-training phase of language models. Specifically, we conducted a comparative analysis between a language model, RoBERTa, pre-trained using CNN/DailyMail…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Text Readability and Simplification

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Attention Dropout · Residual Connection · Adam · Weight Decay · Softmax · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay