DataGen: Unified Synthetic Dataset Generation via Large Language Models
Yue Huang, Siyuan Wu, Chujie Gao, Dongping Chen, Qihui Zhang, Yao Wan, Tianyi Zhou, Jianfeng Gao, Chaowei Xiao, Lichao Sun, Xiangliang Zhang

TL;DR
DataGen is a versatile framework leveraging large language models to generate diverse, accurate, and controllable synthetic datasets, improving benchmarking and data augmentation for AI systems.
Contribution
The paper introduces DataGen, a novel LLM-powered framework that enhances synthetic data generation with diversity, accuracy, and user control mechanisms, addressing key challenges in the field.
Findings
DataGen produces high-quality, diverse datasets validated through mathematical and factual assessments.
Application of DataGen improves LLM benchmarking and data augmentation effectiveness.
Modules within DataGen significantly enhance data quality and task-specific customization.
Abstract
Large Language Models (LLMs) such as GPT-4 and Llama3 have significantly impacted various fields by enabling high-quality synthetic data generation and reducing dependence on expensive human-generated datasets. Despite this, challenges remain in the areas of generalization, controllability, diversity, and truthfulness within the existing generative frameworks. To address these challenges, this paper presents DataGen, a comprehensive LLM-powered framework designed to produce diverse, accurate, and highly controllable datasets. DataGen is adaptable, supporting all types of text datasets and enhancing the generative process through innovative mechanisms. To augment data diversity, DataGen incorporates an attribute-guided generation module and a group checking feature. For accuracy, it employs a code-based mathematical assessment for label verification alongside a retrieval-augmented…
Peer Reviews
Decision·ICLR 2025 Poster
- There's comprehensive work into each of the target attributes (generalization, controllability, diversity, and truthfulness) - The methodology is highly detailed, including comprehensive ablations, evaluations, and cost details. - The details about what synthetic generations other LLMs perform well and poorly on are helpful for further work into synthetic benchmarks.
- There could be more side-by-sides of questions from the original dataset and each generated dataset.
**Novelty and Significance**. The paper presents a novel technique and artifact for the field of synthetic data generation. DataGen is generalizable to other domains and tasks, though with additional overhead. Compared to other related work, DataGen is able to cover a wide range of features in real settings. The artifact is available and runnable. **Writing**. The writing is clear and well-organized, with clear visual / tables to summarizes the comparison, methodology, evaluation, and ablation
**Data formatting.** In section 3.5 (error analysis), the paper mention sometimes the dataset strggles to follow instruction / format the data correctly. Using constrained decoding and similar techniques, this is a very much solved problem, but produce result that the LLM itself may not follow (hence potentially dropping quality of response). I recommend checking out related works in this field (e.g. Guidance[1], AICI[2], LMQL[3], etc.) to improve the data formatting issue. Further more, LLM eng
1. DataGen introduces novel elements like attribute-guided generation and the RAG-based validation, which distinguish it from existing synthetic dataset generation frameworks. 2. The modular design allows for customization and adaptability across diverse datasets. 3. The experiments with improved reasoning and agent-oriented tasks performance shows potential in this data generation framework.
1. RAG-based validation is very high in cost (raising cost from 0.038 to 0.19, almost 5x increase). However, it is unclear how it affects the final data generation quality (like the results in Table 7). In other words, it would be nicer to ablate the modules in terms of metrics in Table 7, instead of the current reports in Table 4. 2. I am not convinced that the performance decline on GSM8K in your experiments can be concluded to that many LLMs may be overstated and overfit on the GSM8K dataset,
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsAttention Is All You Need · Softmax · Layer Normalization · Absolute Position Encodings · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer
