Does Training on Synthetic Data Make Models Less Robust?

Lingze Zhang; Ellie Pavlick

arXiv:2502.07164·cs.CL·March 18, 2025

Does Training on Synthetic Data Make Models Less Robust?

Lingze Zhang, Ellie Pavlick

PDF

Open Access

TL;DR

This study investigates whether training large language models with synthetic data worsens their robustness, finding that synthetic data neither exacerbates nor alleviates known heuristic blindspots in natural language inference tasks.

Contribution

The paper provides empirical evidence on the impact of synthetic data in training LLMs, specifically showing it does not reinforce nor improve heuristic blindspots in NLI tasks.

Findings

01

Synthetic data does not reinforce existing blindspots.

02

Fine-tuning with synthetic data does not worsen heuristic reliance.

03

Synthetic data neither improves nor harms model robustness in the tested scenario.

Abstract

An increasingly common practice is to train large language models (LLMs) using synthetic data. Often this synthetic data is produced by the same or similar LLMs as those it is being used to train. This raises the question of whether the synthetic data might in fact exacerbate certain "blindspots" by reinforcing heuristics that the LLM already encodes. In this paper, we conduct simulated experiments on the natural language inference (NLI) task with Llama-2-7B-hf models. We use MultiNLI as the general task and HANS, a targeted evaluation set designed to measure the presence of specific heuristic strategies for NLI, as our "blindspot" task. Our goal is to determine whether performance disparities between the general and blind spot tasks emerge. Our results indicate that synthetic data does not reinforce blindspots in the way we expected. Specifically, we see that, while fine-tuning with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Machine Learning and Data Classification · Topic Modeling

MethodsSparse Evolutionary Training