GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

Zihong Chen; Wanli Jiang; Jinzhe Li; Zhonghang Yuan; Huanjun Kong; Wanli Ouyang; Nanqing Dong

arXiv:2505.20416·cs.CL·May 28, 2025

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

Zihong Chen, Wanli Jiang, Jinzhe Li, Zhonghang Yuan, Huanjun Kong, Wanli Ouyang, Nanqing Dong

PDF

Open Access 1 Repo 1 Datasets

TL;DR

GraphGen is a knowledge graph-guided framework that generates high-quality, diverse synthetic QA data targeting long-tail knowledge to improve supervised fine-tuning of large language models.

Contribution

It introduces a novel knowledge graph-based approach with multi-hop sampling and style control to enhance synthetic data quality for LLM fine-tuning.

Findings

01

Outperforms traditional synthetic data methods on knowledge-intensive tasks

02

Effectively targets long-tail and complex relational knowledge

03

Improves model calibration and knowledge coverage

Abstract

Fine-tuning for large language models (LLMs) typically requires substantial amounts of high-quality supervised data, which is both costly and labor-intensive to acquire. While synthetic data generation has emerged as a promising solution, existing approaches frequently suffer from factual inaccuracies, insufficient long-tail coverage, simplistic knowledge structures, and homogenized outputs. To address these challenges, we introduce GraphGen, a knowledge graph-guided framework designed for three key question-answering (QA) scenarios: atomic QA, aggregated QA, and multi-hop QA. It begins by constructing a fine-grained knowledge graph from the source text. It then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

open-sciencelab/graphgen
noneOfficial

Datasets

chenzihong/GraphGen-Data
dataset· 18 dl
18 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques