A Little Human Data Goes A Long Way
Dhananjay Ashok, Jonathan May

TL;DR
This study explores the effectiveness of synthetic data in NLP tasks, revealing that replacing most data with synthetic points maintains performance, but small amounts of human data significantly boost results cost-effectively.
Contribution
It demonstrates that a small amount of human data can substantially improve synthetic data training, providing insights into cost-effective data annotation strategies.
Findings
Replacing up to 90% of data with synthetic points marginally affects performance.
Including as few as 125 human data points improves models trained on synthetic data.
A small proportion of human data can be more cost-effective than extensive synthetic data generation.
Abstract
Faced with an expensive human annotation process, creators of NLP systems increasingly turn to synthetic data generation. While this method shows promise, the extent to which synthetic data can replace human annotation is poorly understood. We investigate the use of synthetic data in Fact Verification (FV) and Question Answering (QA) by studying the effects of incrementally replacing human generated data with synthetic points on eight diverse datasets. Strikingly, replacing up to 90% of the training data only marginally decreases performance, but replacing the final 10% leads to severe declines. We find that models trained on purely synthetic data can be reliably improved by including as few as 125 human generated data points. We show that matching the performance gain of just a little additional human data (only 200 points) requires an order of magnitude more synthetic data and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsContext-Aware Activity Recognition Systems
