WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale
Jiaxi Li, Xingxing Zhang, Xun Wang, Xiaolong Huang, Li Dong, Liang, Wang, Si-Qing Chen, Wei Lu, Furu Wei

TL;DR
WildLong introduces a scalable method for synthesizing diverse, realistic long-context instruction datasets from real user queries, significantly improving LLMs' ability to handle complex, multi-document reasoning tasks.
Contribution
It presents a novel data synthesis approach that extracts meta-information and models relationships to generate large-scale, diverse long-context instruction data, surpassing existing methods.
Findings
Models trained on WildLong data outperform existing long-context models on benchmarks.
WildLong data enhances LLMs' performance on complex, multi-document reasoning tasks.
The approach maintains strong performance on short-context tasks without additional short data.
Abstract
Large language models (LLMs) with extended context windows enable tasks requiring extensive information integration but are limited by the scarcity of high-quality, diverse datasets for long-context instruction tuning. Existing data synthesis methods focus narrowly on objectives like fact retrieval and summarization, restricting their generalizability to complex, real-world tasks. WildLong extracts meta-information from real user queries, models co-occurrence relationships via graph-based methods, and employs adaptive generation to produce scalable data. It extends beyond single-document tasks to support multi-document reasoning, such as cross-document comparison and aggregation. Our models, finetuned on 150K instruction-response pairs synthesized using WildLong, surpasses existing open-source long-context-optimized models across benchmarks while maintaining strong performance on…
Peer Reviews
Decision·Submitted to ICLR 2026
- The graph-based meta-information modeling and path sampling strategy enable scalable and diverse instruction generation, grounded in real user interactions. - The paper includes extensive experiments across multiple long-context benchmarks (RULER, HELMET, LongBench-Chat) and compares against a wide range of proprietary, open-source, and specialized long-context models. - The authors thoroughly ablate key components (e.g., path sampling strategy, path length, teacher model impact) and demonstra
- While the framework supports multi-document tasks, the evaluation does not clearly distinguish whether the gains come from single- or multi-document supervision. A more fine-grained breakdown of multi-document task performance (e.g., cross-document reasoning, synthesis) would better validate the extension. - The use of GPT-4 as the primary teacher model for instruction–response generation raises concerns about reproducibility and accessibility. - The paper does not include a qualitative analy
First, the paper combines graph-based modeling with meta-information extraction and uses random walks to generate diverse long-context tasks. This data synthesis idea is both innovative and scalable. Second, it validates performance improvements across multiple well-known benchmarks such as RULER, HELMET, and LongBench, with thorough comparisons against various baselines, including models specifically optimized for long contexts. Third, by extracting meta-information from real dialogue data li
First, the data generation process heavily relies on closed-source models such as GPT-4. Although a Qwen-based experiment is included, reproducibility and sustainability remain limited. Second, the analysis focuses mainly on quantitative improvements but lacks a deeper look into instruction diversity, semantic authenticity, and interpretability. Third, even though the dataset is large, there is no demonstration of its effectiveness or transferability in real downstream applications such as leg
- Novel and Scalable Data Generation: The core strength of the paper is its innovative, graph-based approach to data synthesis. By modeling the co-occurrence of meta-information from real user queries, WildLong can generate a massive and diverse set of instructions that are more realistic and complex than those from previous methods. - Strong Empirical Results: The fine-tuned models show significant performance gains on multiple long-context benchmarks. The Llama-3.1-8B model trained with WildLo
- Limited Scope of "Real-World" Scenarios: The initial meta-information is extracted from the WildChat dataset, which consists of user-ChatGPT conversations. While large, this outdated dataset (before 2024.5) may not capture the full spectrum of long-context reasoning in specialized domains like agentic tasks and coding tasks. - Potential for Inherited Bias: The framework synthesizes data based on patterns in existing conversations (WildChat) and generates responses using an LLM (GPT-4). Any bia
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOnline Learning and Analytics
MethodsFocus
