LinkQA: Synthesizing Diverse QA from Multiple Seeds Strongly Linked by Knowledge Points

Xuemiao Zhang; Can Ren; Chengying Tu; Rongxiang Weng; Hongfei Yan; Jingang Wang; and Xunliang Cai

arXiv:2508.01317·cs.CL·August 7, 2025

LinkQA: Synthesizing Diverse QA from Multiple Seeds Strongly Linked by Knowledge Points

Xuemiao Zhang, Can Ren, Chengying Tu, Rongxiang Weng, Hongfei Yan, Jingang Wang, and Xunliang Cai

PDF

TL;DR

This paper introduces LinkSyn, a knowledge point graph-based framework for synthesizing diverse question-answering data, significantly improving large language models' performance through a large, high-quality dataset called LinkQA.

Contribution

The paper presents a novel KP graph-based synthesis method, LinkSyn, enabling flexible control over data diversity, difficulty, and coverage, and demonstrates its effectiveness in creating a large QA dataset.

Findings

01

Synthesized LinkQA dataset with 50B tokens.

02

Pre-training with LinkQA improves Llama-3 8B performance by 11.51%.

03

Achieves new state-of-the-art results on MMLU and CMMLU.

Abstract

The advancement of large language models (LLMs) struggles with the scarcity of high-quality, diverse training data. To address this limitation, we propose LinkSyn, a novel knowledge point (KP) graph-based synthesis framework that enables flexible control over discipline and difficulty distributions while balancing KP coverage and popularity. LinkSyn extracts KPs from question-answering (QA) seed data and constructs a KP graph to synthesize diverse QA data from multiple seeds strongly linked by KPs and sampled from graph walks. Specifically, LinkSyn incorporates (1) a knowledge distribution value function to guide the adjustment of path sampling probability and balance KP coverage and popularity during graph walks; (2) diffusion-based synthesis via DeepSeek-R1 by leveraging multiple seeds with dense logical associations along each path; and (3) high-difficulty QA enhancement within given…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.