GATEAU: Selecting Influential Samples for Long Context Alignment

Shuzheng Si; Haozhe Zhao; Gang Chen; Yunshui Li; Kangyang Luo; Chuancheng Lv; Kaikai An; Fanchao Qi; Baobao Chang; Maosong Sun

arXiv:2410.15633·cs.CL·September 16, 2025

GATEAU: Selecting Influential Samples for Long Context Alignment

Shuzheng Si, Haozhe Zhao, Gang Chen, Yunshui Li, Kangyang Luo, Chuancheng Lv, Kaikai An, Fanchao Qi, Baobao Chang, Maosong Sun

PDF

Open Access 1 Repo 1 Video

TL;DR

GATEAU is a framework that improves long context alignment in large language models by selecting influential samples with long-range dependencies, enhancing instruction-following and understanding capabilities.

Contribution

GATEAU introduces a novel sample selection method focusing on long-range dependencies, addressing data quality issues in long context training datasets.

Findings

01

Improves model performance on long-context tasks

02

Effectively identifies influential samples with long-range dependencies

03

Enhances instruction-following and comprehension abilities

Abstract

Aligning large language models to handle instructions with extremely long contexts has yet to be fully investigated. Previous studies have attempted to scale up the available data volume by synthesizing long instruction-following samples, as constructing such a dataset tends to be challenging for annotators. However, a lack of a well-defined strategy for ensuring data quality may introduce low-quality samples and restrict the model's performance. Thus, we propose GATEAU, a novel framework to address the unique challenge of long context alignment by identifying the influential samples enriched with long-range dependency relations. Specifically, GATEAU measures the long-range dependencies from two essential aspects: the difficulty of generating target responses due to the long-range dependencies, and the difficulty of understanding long inputs due to such dependencies. Comprehensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

s1s-z/gateau
jaxOfficial

Videos

GATEAU: Selecting Influential Samples for Long Context Alignment· underline

Taxonomy

TopicsSemantic Web and Ontologies · Data Management and Algorithms · Geographic Information Systems Studies

MethodsClass-activation map