Quest: Query-centric Data Synthesis Approach for Long-context Scaling of   Large Language Model

Chaochen Gao; Xing Wu; Qi Fu; Songlin Hu

arXiv:2405.19846·cs.CL·February 12, 2025

Quest: Query-centric Data Synthesis Approach for Long-context Scaling of Large Language Model

Chaochen Gao, Xing Wu, Qi Fu, Songlin Hu

PDF

Open Access 1 Repo 1 Video

TL;DR

Quest is a novel query-centric data synthesis method that enhances long-context scaling of large language models by balancing semantic relevance and diversity, leading to improved performance on complex tasks with very long input sequences.

Contribution

The paper introduces Quest, a new data synthesis approach that groups semantically relevant documents based on predicted queries, addressing limitations of previous methods in long-context training.

Findings

01

Outperforms existing methods on long-context tasks

02

Effective with context lengths up to 1 million tokens

03

Scalable across various model sizes

Abstract

Recent advancements in large language models (LLMs) have highlighted the importance of extending context lengths for handling complex tasks. While traditional methods for training on long contexts often use filtered long documents, these approaches lead to domain imbalances, limiting model performance. To address this, techniques like random document concatenation (Standard) and similarity-based methods (KNN, ICLM) have been developed. However, they either sacrifice semantic coherence or diversity. To balance both aspects, we introduce Quest, a query-centric data synthesis method aggregating semantically relevant yet diverse documents. Quest uses a generative model to predict potential queries for each document, grouping documents with similar queries and keywords. Extensive experiments demonstrate Quest's superior performance on long-context tasks, achieving remarkable results with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

caskcsg/longcontext
pytorch

Videos

Quest: Query-centric Data Synthesis Approach for Long-context Scaling of Large Language Model· slideslive

Taxonomy

TopicsTopic Modeling · Data Mining Algorithms and Applications · Data Quality and Management