Target-Oriented Pretraining Data Selection via Neuron-Activated Graph

Zijun Wang; Haoqin Tu; Weidong Zhou; Yiyang Zhou; Xiaohuan Zhou; Bingni Zhang; Weiguo Feng; Taifeng Wang; Cihang Xie; Fengze Liu

arXiv:2604.15706·cs.CL·April 20, 2026

Target-Oriented Pretraining Data Selection via Neuron-Activated Graph

Zijun Wang, Haoqin Tu, Weidong Zhou, Yiyang Zhou, Xiaohuan Zhou, Bingni Zhang, Weiguo Feng, Taifeng Wang, Cihang Xie, Fengze Liu

PDF

1 Repo

TL;DR

This paper introduces NAG-based Ranking, a target-oriented data selection method for language models that uses neuron impact analysis to improve pretraining effectiveness and interpretability.

Contribution

The authors propose a novel neuron impact-based framework for target data selection that outperforms existing methods and provides interpretability of the pretraining process.

Findings

01

NAG improves target-oriented pretraining by 4.9% on average across six benchmarks.

02

NAG outperforms state-of-the-art baselines by 5.3% accuracy on HellaSwag.

03

Deactivating NAG-selected neurons causes a 23.5% performance drop, showing their importance.

Abstract

Everyday tasks come with a target, and pretraining models around this target is what turns them into experts. In this paper, we study target-oriented language model (LM) pretraining by introducing Neuron-Activated Graph Ranking (NAG-based Ranking), a training-free and interpretable framework for target pretraining data selection. Rather than using black-box representations, our approach directly characterizes each target input by a sparse set of high-impact neurons in any off-the-shelf LLMs. Concretely, we quantify neuron impact and select the most influential neurons across layers into a compact Neuron-Activated Graph (NAG), and rank candidate data by NAG similarity to target examples. We conduct experiments across six benchmarks, where our NAG-based Ranking improves target-oriented pretraining by 4.9% on average over random sampling, and also outperforms state-of-the-art baselines by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

asillycat/NAG
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.