An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models

Chengli Xing; Zhengran Zeng; Gexiang Fang; Rui Xie; Wei Ye; Shikun Zhang

arXiv:2604.07769·cs.SE·April 10, 2026

An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models

Chengli Xing, Zhengran Zeng, Gexiang Fang, Rui Xie, Wei Ye, Shikun Zhang

PDF

TL;DR

This paper empirically evaluates influence-based data filtering for pretraining large language models on code, showing it can improve performance and varies across tasks.

Contribution

It introduces a data-influence-score calculation method for programming tasks and assesses its effectiveness in improving Code-LLMs.

Findings

01

Data-influence-score filtering improves model performance.

02

Beneficial training data criteria vary across programming tasks.

03

Validation-set-loss based filtering is feasible and effective.

Abstract

Recent advancements in code large language models (Code-LLMs) have demonstrated remarkable capabilities in resolving programming related tasks. Meanwhile, researchers have recognized that the quality of pre-training data is crucial for improving LLM performance. However, most of the existing research on pre-training data filtering has focused on general datasets, and little attention for programming datasets. In this paper, we aim to address this gap by exploring the effectiveness of a widely used general data filtering technique, i.e., data-influence-score filtering, within the context of programming-related datasets. To this end, we first introduce a method for calculating data-influence-score for generative programming tasks which involves transforming a variety of downstream coding tasks into validation sets and using the models loss on these sets as a performance metric. Next, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.