Studying the impacts of pre-training using ChatGPT-generated text on downstream tasks
Sarthak Anand

TL;DR
This study investigates whether using ChatGPT-generated text in pre-training affects language model performance and bias, finding no significant impact on downstream task results or gender bias.
Contribution
It provides the first empirical analysis of the effects of artificial, LLM-generated text in pre-training on downstream task performance and bias.
Findings
Artificial text in pre-training does not significantly affect downstream task performance.
Pre-training with ChatGPT-generated text does not increase gender bias.
Results suggest robustness of models to training data source variations.
Abstract
In recent times, significant advancements have been witnessed in the field of language models, particularly with the emergence of Large Language Models (LLMs) that are trained on vast amounts of data extracted from internet archives. These LLMs, such as ChatGPT, have become widely accessible, allowing users to generate text for various purposes including articles, essays, jokes, and poetry. Given that LLMs are trained on a diverse range of text sources, encompassing platforms like Reddit and Twitter, it is foreseeable that future training datasets will also incorporate text generated by previous iterations of the models themselves. In light of this development, our research aims to investigate the influence of artificial text in the pre-training phase of language models. Specifically, we conducted a comparative analysis between a language model, RoBERTa, pre-trained using CNN/DailyMail…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Text Readability and Simplification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Attention Dropout · Residual Connection · Adam · Weight Decay · Softmax · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay
