AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection

Kai Hua; Steven Wu; Ge Zhang; Ke Shen

arXiv:2505.07293·cs.CL·May 13, 2025

AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection

Kai Hua, Steven Wu, Ge Zhang, Ke Shen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces AttentionInfluence, a training-free, attention head masking method that enables small models to select high-quality reasoning data, significantly enhancing large model performance on reasoning benchmarks.

Contribution

It proposes a novel, supervision-free data selection method using attention head influence, improving reasoning data quality for pretraining large language models.

Findings

01

Significant performance improvements on reasoning benchmarks.

02

Effective data selection without supervision or human labeling.

03

Scalable approach for weak-to-strong model training.

Abstract

Recently, there has been growing interest in collecting reasoning-intensive pretraining data to improve LLMs' complex reasoning ability. Prior approaches typically rely on supervised classifiers to identify such data, which requires labeling by humans or LLMs, often introducing domain-specific biases. Due to the attention heads being crucial to in-context reasoning, we propose AttentionInfluence, a simple yet effective, training-free method without supervision signal. Our approach enables a small pretrained language model to act as a strong data selector through a simple attention head masking operation. Specifically, we identify retrieval heads and compute the loss difference when masking these heads. We apply AttentionInfluence to a 1.3B-parameter dense model to conduct data selection on the SmolLM corpus of 241B tokens, and mix the SmolLM corpus with the selected subset comprising…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. The paper proposes a new pre-training data selection method with a focus on the efficiency of data selection and weak-to-strong generalization. Such new perspectives on pre-training data selection contribute to the literature beyond language modeling. 2. The proposed method is well grounded in the interpretability literature, and experiments across multiple benchmarks provide empirical support. 3. The paper presents comprehensive analyses of different design choices associated with the prop

Weaknesses

1. There exists a mismatch between the functionality of retrieval heads (long-context retrieval and reasoning) and the downstream task of the paper (pre-training data selection), and this leads to my concern about whether the proposed method is appropriate and well-motivated. In the literature, the retrieval heads are shown to be important for long-context retrieval, understanding, and reasoning tasks (e.g., needle-in-the-haystack), but their influences on short-context tasks are much less stron

Reviewer 02Rating 6Confidence 3

Strengths

1. The paper introduces a new perspective by leveraging mechanistic interpretability (retrieval head behavior) for pretraining data selection. 2. It provides detailed ablations and qualitative analyses. 3. The method is effective as demonstrated by the pretraining experiments while being entirely training-free and unsupervised.

Weaknesses

Since only one pretraining corpus (SmolLM) and one pretraining model (a 7B model) are used, the robustness and generalizability of the method may be limited. Considering the high cost of pretraining and the theoretical generality of the AttentionInfluence method, it should be possible to further verify its effectiveness through post-training experiments.

Reviewer 03Rating 2Confidence 3

Strengths

* The design of the proposed metric (AttentionInfluence Score) is convincing for data selection. * The proposed data selection process outperforms relevant baselines.

Weaknesses

* The samples used to identify the important attention heads are very important for the later data selection, as the data instances for pre-training are selected mostly from their signals. In Section 4.1, the authors mention that it is derived from 800 synthetic samples, and it is questionable whether the selected data instances are just very similar to those synthetic samples. Also, more details on constructing those samples and their quality should be provided. Lastly, it would be great if the

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need