AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection
Kai Hua, Steven Wu, Ge Zhang, Ke Shen

TL;DR
This paper introduces AttentionInfluence, a training-free, attention head masking method that enables small models to select high-quality reasoning data, significantly enhancing large model performance on reasoning benchmarks.
Contribution
It proposes a novel, supervision-free data selection method using attention head influence, improving reasoning data quality for pretraining large language models.
Findings
Significant performance improvements on reasoning benchmarks.
Effective data selection without supervision or human labeling.
Scalable approach for weak-to-strong model training.
Abstract
Recently, there has been growing interest in collecting reasoning-intensive pretraining data to improve LLMs' complex reasoning ability. Prior approaches typically rely on supervised classifiers to identify such data, which requires labeling by humans or LLMs, often introducing domain-specific biases. Due to the attention heads being crucial to in-context reasoning, we propose AttentionInfluence, a simple yet effective, training-free method without supervision signal. Our approach enables a small pretrained language model to act as a strong data selector through a simple attention head masking operation. Specifically, we identify retrieval heads and compute the loss difference when masking these heads. We apply AttentionInfluence to a 1.3B-parameter dense model to conduct data selection on the SmolLM corpus of 241B tokens, and mix the SmolLM corpus with the selected subset comprising…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper proposes a new pre-training data selection method with a focus on the efficiency of data selection and weak-to-strong generalization. Such new perspectives on pre-training data selection contribute to the literature beyond language modeling. 2. The proposed method is well grounded in the interpretability literature, and experiments across multiple benchmarks provide empirical support. 3. The paper presents comprehensive analyses of different design choices associated with the prop
1. There exists a mismatch between the functionality of retrieval heads (long-context retrieval and reasoning) and the downstream task of the paper (pre-training data selection), and this leads to my concern about whether the proposed method is appropriate and well-motivated. In the literature, the retrieval heads are shown to be important for long-context retrieval, understanding, and reasoning tasks (e.g., needle-in-the-haystack), but their influences on short-context tasks are much less stron
1. The paper introduces a new perspective by leveraging mechanistic interpretability (retrieval head behavior) for pretraining data selection. 2. It provides detailed ablations and qualitative analyses. 3. The method is effective as demonstrated by the pretraining experiments while being entirely training-free and unsupervised.
Since only one pretraining corpus (SmolLM) and one pretraining model (a 7B model) are used, the robustness and generalizability of the method may be limited. Considering the high cost of pretraining and the theoretical generality of the AttentionInfluence method, it should be possible to further verify its effectiveness through post-training experiments.
* The design of the proposed metric (AttentionInfluence Score) is convincing for data selection. * The proposed data selection process outperforms relevant baselines.
* The samples used to identify the important attention heads are very important for the later data selection, as the data instances for pre-training are selected mostly from their signals. In Section 4.1, the authors mention that it is derived from 800 synthetic samples, and it is questionable whether the selected data instances are just very similar to those synthetic samples. Also, more details on constructing those samples and their quality should be provided. Lastly, it would be great if the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need
