From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding

Chiwei Zhu; Benfeng Xu; Xiaorui Wang; Zhendong Mao

arXiv:2506.03968·cs.CL·June 5, 2025

From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding

Chiwei Zhu, Benfeng Xu, Xiaorui Wang, Zhendong Mao

PDF

Open Access 2 Models 1 Datasets 1 Video

TL;DR

This paper presents a novel method for synthesizing large-scale, diverse, and complex user instructions grounded in real-world contexts using attributed grounding, resulting in a dataset that improves language model performance.

Contribution

It introduces a new attributed grounding framework for generating diverse instructions from web data, enabling scalable synthesis of meaningful instructions for model training.

Findings

01

Constructed a dataset of 1 million instructions, SynthQuestions.

02

Models trained on SynthQuestions outperform existing benchmarks.

03

Performance improves with more web data used in synthesis.

Abstract

The pursuit of diverse, complex, and large-scale instruction data is crucial for automatically aligning large language models (LLMs). While there are methods capable of generating synthetic instructions at scale, they either suffer from limited grounding sources, leading to a narrow distribution, or rely on trivial extensions that fail to produce meaningful trajectories in terms of complexity. In contrast, instructions that benefit efficient alignment are typically crafted with cognitive insights and grounded in real-world use cases. In this paper, we synthesize such instructions using attributed grounding, which involves 1) a top-down attribution process that grounds a selective set of real instructions to situated users, and 2) a bottom-up synthesis process that leverages web documents to first generate a situation, then a meaningful instruction. This framework allows us to harvest…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

IgnoraZ/SynthQuestions
dataset· 64 dl
64 dl

Videos

From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification