Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Alisia Lupidi; Carlos Gemmell; Nicola Cancedda; Jane Dwivedi-Yu; Jason Weston; Jakob Foerster; Roberta Raileanu; Maria Lomeli

arXiv:2409.08239·cs.CL·August 21, 2025

Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli

PDF

Open Access 3 Reviews

TL;DR

Source2Synth is a scalable method for generating high-quality synthetic data grounded in real data sources, improving large language model performance on reasoning tasks by filtering low-quality generations.

Contribution

It introduces a novel approach that produces and curates synthetic data with reasoning steps, enhancing dataset quality for complex question answering tasks.

Findings

01

Improves TQA accuracy by 25.51% on WikiSQL.

02

Enhances MHQA performance by 22.57% on HotpotQA.

03

Effectively filters low-quality synthetic data.

Abstract

Synthetic data generation has recently emerged as a promising approach for enhancing the capabilities of large language models (LLMs) without the need for expensive human annotations. However, existing methods often generate data that can be low quality or contrived. In this paper, we introduce Source2Synth, a scalable approach for synthetic data generation and curation that is grounded in real-world data sources. Source2Synth takes as input a custom data source and produces synthetic data examples with intermediate reasoning steps. Our method improves the dataset quality by discarding low-quality generations based on their answerability. We demonstrate the generality of this approach by applying it to two tasks that leverage two different types of data: multi-hop question answering (MHQA), where we test complex reasoning abilities leveraging documents, and tabular question answering…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 4

Strengths

• It is novel and interesting to use model train on part of the data to filter generated data in the other half • The proposed pipeline brings significant improvement in performance, even competitive with finetuned baselines.

Weaknesses

• The paper could use much clearer presentation (grammar, wording, formatting, etc...) • Consider discussing how the proposed pipeline avoids data leak (does it?) during the data generation step, esp. given that HotpotQA is also constructed from Wikipedia articles. • For Tables 1 and 2, there is no comparison between LLMSynth (Synthetic dataset only) and LLMCurated (Synthetic dataset only). Adding such comparisons can further consolidate the importance of the curation step. • The scope of thi

Reviewer 02Rating 3Confidence 4

Strengths

* The paper proposes a nice idea to automatically generate compositional queries. They also propose techniques to automatically filter them using an LLM tuned on the synthesized data. * The idea is described neatly and baseline performances are evaluated.

Weaknesses

1. The experimental evaluation steps are lacking clarity. A lot of important details are missing which significantly affects the quality of the work. Please see questions Q1-Q6. 2. The method is applied somewhat in a limited context on two problems, but it’s not clear how many source examples they generated the synthetic examples on. It sort of seems much more limited than the size of the original dataset e.g. HotPotQA already has 113k questions, but does not seem like the synthesized dataset ma

Reviewer 03Rating 3Confidence 4

Strengths

The idea of using part of synthesized dataset to finetune LLM for filtering is interesting

Weaknesses

The paper is not well written, hard to follow, and some important details are missing. For example, it is not clear how the two slices of data is selected, where one is used to train an LLM to do curation for the other slice. The generation relies on existence of data source and ability to get seed data, which limits its applicability, in particular in situation where the domain and tasks are new. The method uses LLM to do data curation, which implicitly assumes the LLM already have reasonable

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Time Series Analysis and Forecasting · Video Analysis and Summarization