A Little Human Data Goes A Long Way

Dhananjay Ashok; Jonathan May

arXiv:2410.13098·cs.CL·August 21, 2025

A Little Human Data Goes A Long Way

Dhananjay Ashok, Jonathan May

PDF

Open Access 1 Repo 1 Video

TL;DR

This study explores the effectiveness of synthetic data in NLP tasks, revealing that replacing most data with synthetic points maintains performance, but small amounts of human data significantly boost results cost-effectively.

Contribution

It demonstrates that a small amount of human data can substantially improve synthetic data training, providing insights into cost-effective data annotation strategies.

Findings

01

Replacing up to 90% of data with synthetic points marginally affects performance.

02

Including as few as 125 human data points improves models trained on synthetic data.

03

A small proportion of human data can be more cost-effective than extensive synthetic data generation.

Abstract

Faced with an expensive human annotation process, creators of NLP systems increasingly turn to synthetic data generation. While this method shows promise, the extent to which synthetic data can replace human annotation is poorly understood. We investigate the use of synthetic data in Fact Verification (FV) and Question Answering (QA) by studying the effects of incrementally replacing human generated data with synthetic points on eight diverse datasets. Strikingly, replacing up to 90% of the training data only marginally decreases performance, but replacing the final 10% leads to severe declines. We find that models trained on purely synthetic data can be reliably improved by including as few as 125 human generated data points. We show that matching the performance gain of just a little additional human data (only 200 points) requires an order of magnitude more synthetic data and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dhananjayashok/littlehumandata
pytorchOfficial

Videos

A Little Human Data Goes A Long Way· underline

Taxonomy

TopicsContext-Aware Activity Recognition Systems