LongAttn: Selecting Long-context Training Data via Token-level Attention

Longyun Wu; Dawei Zhu; Guangxiang Zhao; Zhuocheng Yu; Junfeng Ran,; Xiangyu Wong; Lin Sun; Sujian Li

arXiv:2502.16860·cs.CL·February 28, 2025

LongAttn: Selecting Long-context Training Data via Token-level Attention

Longyun Wu, Dawei Zhu, Guangxiang Zhao, Zhuocheng Yu, Junfeng Ran,, Xiangyu Wong, Lin Sun, Sujian Li

PDF

Open Access 1 Repo

TL;DR

LongAttn introduces a token-level attention framework to improve the selection of long-context training data for large language models, enhancing efficiency and effectiveness in capturing long-range dependencies.

Contribution

It presents a novel token-level approach leveraging self-attention to better quantify long-range dependencies for data selection, outperforming sentence-level methods.

Findings

01

Effective long-range dependency measurement

02

Improved data selection efficiency

03

High-quality long-context dataset released

Abstract

With the development of large language models (LLMs), there has been an increasing need for significant advancements in handling long contexts. To enhance long-context capabilities, constructing high-quality training data with long-range dependencies is crucial. Existing methods to select long-context data often rely on sentence-level analysis, which can be greatly optimized in both performance and efficiency. In this paper, we propose a novel token-level framework, LongAttn, which leverages the self-attention mechanism of LLMs to measure the long-range dependencies for the data. By calculating token-level dependency strength and distribution uniformity of token scores, LongAttn effectively quantifies long-range dependencies, enabling more accurate and efficient data selection. We filter LongABC-32K from open-source long-context datasets (ArXiv, Book, and Code). Through our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Lyun0912-wu/LongAttn
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Topic Modeling · Machine Learning in Healthcare