LinkQA: Synthesizing Diverse QA from Multiple Seeds Strongly Linked by Knowledge Points
Xuemiao Zhang, Can Ren, Chengying Tu, Rongxiang Weng, Hongfei Yan, Jingang Wang, and Xunliang Cai

TL;DR
This paper introduces LinkSyn, a knowledge point graph-based framework for synthesizing diverse question-answering data, significantly improving large language models' performance through a large, high-quality dataset called LinkQA.
Contribution
The paper presents a novel KP graph-based synthesis method, LinkSyn, enabling flexible control over data diversity, difficulty, and coverage, and demonstrates its effectiveness in creating a large QA dataset.
Findings
Synthesized LinkQA dataset with 50B tokens.
Pre-training with LinkQA improves Llama-3 8B performance by 11.51%.
Achieves new state-of-the-art results on MMLU and CMMLU.
Abstract
The advancement of large language models (LLMs) struggles with the scarcity of high-quality, diverse training data. To address this limitation, we propose LinkSyn, a novel knowledge point (KP) graph-based synthesis framework that enables flexible control over discipline and difficulty distributions while balancing KP coverage and popularity. LinkSyn extracts KPs from question-answering (QA) seed data and constructs a KP graph to synthesize diverse QA data from multiple seeds strongly linked by KPs and sampled from graph walks. Specifically, LinkSyn incorporates (1) a knowledge distribution value function to guide the adjustment of path sampling probability and balance KP coverage and popularity during graph walks; (2) diffusion-based synthesis via DeepSeek-R1 by leveraging multiple seeds with dense logical associations along each path; and (3) high-difficulty QA enhancement within given…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
