Regurgitative Training: The Value of Real Data in Training Large   Language Models

Jinghui Zhang; Dandan Qiao; Mochen Yang; Qiang Wei

arXiv:2407.12835·cs.CL·July 26, 2024·3 cites

Regurgitative Training: The Value of Real Data in Training Large Language Models

Jinghui Zhang, Dandan Qiao, Mochen Yang, Qiang Wei

PDF

Open Access

TL;DR

Training large language models with data generated by other LLMs significantly hampers their performance, emphasizing the importance of real human-generated data for effective model training.

Contribution

This study systematically evaluates the impact of regurgitative training on LLM performance and proposes mitigation strategies to address the associated challenges.

Findings

01

Regurgitative training reduces LLM performance across tasks.

02

Higher error rates and lower lexical diversity in LLM-generated data cause performance decline.

03

Mitigation strategies improve, but do not fully recover, performance loss.

Abstract

What happens if we train a new Large Language Model (LLM) using data that are at least partially generated by other LLMs? The explosive success of LLMs means that a substantial amount of content online will be generated by LLMs rather than humans, which will inevitably enter the training datasets of next-generation LLMs. We evaluate the implications of such "regurgitative training" on LLM performance. Through fine-tuning GPT-3.5 with data generated either by itself or by other LLMs in a machine translation task, we find strong evidence that regurgitative training clearly handicaps the performance of LLMs. The same performance loss of regurgitative training is observed on transformer models that we train from scratch. We find suggestive evidence that the performance disadvantage of regurgitative training can be attributed to at least two mechanisms: (1) higher error rates and (2) lower…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · Attention Is All You Need · Residual Connection · Adam · Dropout · Byte Pair Encoding · Cosine Annealing · Layer Normalization · Linear Layer