Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and the Case of Information Extraction
Martin Josifoski, Marija Sakota, Maxime Peyrard, Robert West

TL;DR
This paper introduces SynthIE, a method that exploits task asymmetry with LLMs to generate high-quality synthetic data for complex information extraction tasks, significantly improving small model performance.
Contribution
The paper presents a novel approach to generate synthetic training data for structured tasks by reversing the task prompt, enabling high-quality data creation for complex problems.
Findings
Generated 1.8M data points with high quality
SynthIE outperforms previous models by 57 points in micro-F1
SynthIE improves macro-F1 by 79 points
Abstract
Large language models (LLMs) have great potential for synthetic data generation. This work shows that useful data can be synthetically generated even for tasks that cannot be solved directly by LLMs: for problems with structured outputs, it is possible to prompt an LLM to perform the task in the reverse direction, by generating plausible input text for a target output structure. Leveraging this asymmetry in task difficulty makes it possible to produce large-scale, high-quality data for complex tasks. We demonstrate the effectiveness of this approach on closed information extraction, where collecting ground-truth data is challenging, and no satisfactory dataset exists to date. We synthetically generate a dataset of 1.8M data points, establish its superior quality compared to existing datasets in a human evaluation, and use it to finetune small models (220M and 770M parameters), termed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
