Quest: Query-centric Data Synthesis Approach for Long-context Scaling of Large Language Model
Chaochen Gao, Xing Wu, Qi Fu, Songlin Hu

TL;DR
Quest is a novel query-centric data synthesis method that enhances long-context scaling of large language models by balancing semantic relevance and diversity, leading to improved performance on complex tasks with very long input sequences.
Contribution
The paper introduces Quest, a new data synthesis approach that groups semantically relevant documents based on predicted queries, addressing limitations of previous methods in long-context training.
Findings
Outperforms existing methods on long-context tasks
Effective with context lengths up to 1 million tokens
Scalable across various model sizes
Abstract
Recent advancements in large language models (LLMs) have highlighted the importance of extending context lengths for handling complex tasks. While traditional methods for training on long contexts often use filtered long documents, these approaches lead to domain imbalances, limiting model performance. To address this, techniques like random document concatenation (Standard) and similarity-based methods (KNN, ICLM) have been developed. However, they either sacrifice semantic coherence or diversity. To balance both aspects, we introduce Quest, a query-centric data synthesis method aggregating semantically relevant yet diverse documents. Quest uses a generative model to predict potential queries for each document, grouping documents with similar queries and keywords. Extensive experiments demonstrate Quest's superior performance on long-context tasks, achieving remarkable results with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Data Mining Algorithms and Applications · Data Quality and Management
