LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions

Chaochen Gao; Xing Wu; Zijia Lin; Debing Zhang; Songlin Hu

arXiv:2505.17134·cs.CL·June 4, 2025

LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions

Chaochen Gao, Xing Wu, Zijia Lin, Debing Zhang, Songlin Hu

PDF

1 Video

TL;DR

LongMagpie is a self-synthesis framework that automatically generates large-scale, high-quality long-context instruction data for language models, reducing reliance on human annotation and template-based methods.

Contribution

It introduces a novel self-synthesis approach that leverages aligned LLMs to generate diverse long-context instruction data without human effort.

Findings

01

Achieves leading performance on long-context tasks

02

Maintains competitive performance on short-context tasks

03

Demonstrates effectiveness across multiple benchmarks

Abstract

High-quality long-context instruction data is essential for aligning long-context large language models (LLMs). Despite the public release of models like Qwen and Llama, their long-context instruction data remains proprietary. Human annotation is costly and challenging, while template-based synthesis methods limit scale, diversity, and quality. We introduce LongMagpie, a self-synthesis framework that automatically generates large-scale long-context instruction data. Our key insight is that aligned long-context LLMs, when presented with a document followed by special tokens preceding a user turn, auto-regressively generate contextually relevant queries. By harvesting these document-query pairs and the model's responses, LongMagpie produces high-quality instructions without human effort. Experiments on HELMET, RULER, and Longbench v2 demonstrate that LongMagpie achieves leading…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions· slideslive