TL;DR
LongMagpie is a self-synthesis framework that automatically generates large-scale, high-quality long-context instruction data for language models, reducing reliance on human annotation and template-based methods.
Contribution
It introduces a novel self-synthesis approach that leverages aligned LLMs to generate diverse long-context instruction data without human effort.
Findings
Achieves leading performance on long-context tasks
Maintains competitive performance on short-context tasks
Demonstrates effectiveness across multiple benchmarks
Abstract
High-quality long-context instruction data is essential for aligning long-context large language models (LLMs). Despite the public release of models like Qwen and Llama, their long-context instruction data remains proprietary. Human annotation is costly and challenging, while template-based synthesis methods limit scale, diversity, and quality. We introduce LongMagpie, a self-synthesis framework that automatically generates large-scale long-context instruction data. Our key insight is that aligned long-context LLMs, when presented with a document followed by special tokens preceding a user turn, auto-regressively generate contextually relevant queries. By harvesting these document-query pairs and the model's responses, LongMagpie produces high-quality instructions without human effort. Experiments on HELMET, RULER, and Longbench v2 demonstrate that LongMagpie achieves leading…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
