Unveiling the Flaws: Exploring Imperfections in Synthetic Data and   Mitigation Strategies for Large Language Models

Jie Chen; Yupeng Zhang; Bingning Wang; Wayne Xin Zhao; Ji-Rong Wen,; Weipeng Chen

arXiv:2406.12397·cs.CL·June 19, 2024

Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models

Jie Chen, Yupeng Zhang, Bingning Wang, Wayne Xin Zhao, Ji-Rong Wen,, Weipeng Chen

PDF

Open Access 1 Video

TL;DR

This paper investigates the inherent flaws in synthetic data used for training large language models, particularly pattern overfitting in question-answer pairs, and proposes an unlearning-based mitigation method that improves instruction-following capabilities without sacrificing benchmark performance.

Contribution

It identifies specific flaws in synthetic data, especially pattern overfitting in Q-A pairs, and introduces an unlearning technique to mitigate these issues effectively.

Findings

01

Unlearning techniques can reverse instruction-following issues caused by synthetic data flaws.

02

The proposed method improves robustness of LLMs without performance loss on benchmarks.

03

Synthetic data flaws can be mitigated at relatively low computational cost.

Abstract

Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs). Studies have shown that synthetic data can effectively improve the performance of LLMs on downstream benchmarks. However, despite its potential benefits, our analysis suggests that there may be inherent flaws in synthetic data. The uniform format of synthetic data can lead to pattern overfitting and cause significant shifts in the output distribution, thereby reducing the model's instruction-following capabilities. Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws. The empirical results demonstrate the effectiveness of our approach, which can reverse the instruction-following issues caused by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques