TL;DR
This paper introduces LongFilter, a data curation framework that enhances long-context language model pretraining by selecting samples with meaningful long-range dependencies, leading to improved performance on various benchmarks.
Contribution
The paper presents LongFilter, a novel method for filtering training data based on long-range information gain, optimizing long-context model pretraining efficiency and effectiveness.
Findings
LongFilter improves data quality for long-context training.
Enhanced models show better performance on long-range reasoning benchmarks.
Efficient data selection reduces training on irrelevant samples.
Abstract
Long-context language models unlock advanced capabilities in reasoning, code generation, and document summarization by leveraging dependencies across extended spans of text. However, a significant portion of readily available long-text data lacks meaningful long-distance dependencies; most spans can be predicted using only local context. Training on such data is inefficient, making careful data selection crucial. Therefore, we introduce LongFilter, a framework for curating training data tailored to long-context pretraining. LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings, thereby identifying samples where long-range dependencies are essential. Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields…
Peer Reviews
Decision·ICLR 2026 Poster
* Novel and principled metric: The use of conditional mutual information (via KL divergence) to quantify long-range dependency is theoretically grounded and practically effective. * Strong empirical validation: Experiments across multiple benchmarks and data domains (ArXiv, Books, CommonCrawl) demonstrate robustness. LongFilter outperforms length-based baselines (ProLong) and a recent alternative (LongWanjuan). * High practical impact: Achieves comparable performance with half the training tok
* Computational cost: Scoring requires forward passes with both short and long contexts using a large model (Llama-3.1-8B), which may be prohibitive for very large corpora despite optimizations. * Limited model scope: Evaluation is restricted to LLaMA-3-8B; generalizability to other architectures (e.g., Mamba, RWKV) or smaller models is unverified. * Task coverage: Benchmarks focus on retrieval and structured reasoning; performance on narrative coherence or open-ended generation is not assesse
The paper show the fact that sequence length alone is an insufficient proxy for data quality. Much long-text data does not genuinely require long-range dependencies, which can dilute the training signal. The proposed method, LongFilter, quantifies the informational value of extended context, which is a significant contribution for LLM pretraining. The experiments are well-founded. The experiments use Llama-3.1-8B, extend it to a significant context length (8K to 64K) , and train on a large-sc
The paper highlights the training efficiency gains, but the data curation step itself appears to have a high computational cost. The authors report that scoring each corpus required 32 NVIDIA H100 GPUs for a single day. The paper uses the Llama-3.1-8B model as the scoring model to conduct experiments and achieve favorable results. However, the paper does not test other models for generating the scores, which suggests a lack of generalizability regarding the choice of the scoring model in the ex
1. This work addresses an important but often overlooked issue in scaling up LLMs’ context windows, i.e., not all “long” data is useful for learning true long-range dependencies. 2. The authors propose a clear and interpretable metric (information gain) for identifying valuable training samples. 3. The paper provides an additional analysis on what kinds of text benefit most from extended contexts, informing future research and practical data selection.
1. All the results and analysis are made on one model (i.e., LLaMA-3-8B). It could be better to show the effectiveness on more backbone models. 2. Beyond the long-context understanding tasks, to further demonstrate the improvements on processing long-range information , it could be better to show the performance on long generation tasks, such as long reasoning and long writing tasks.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
