An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models
Chengli Xing, Zhengran Zeng, Gexiang Fang, Rui Xie, Wei Ye, Shikun Zhang

TL;DR
This paper empirically evaluates influence-based data filtering for pretraining large language models on code, showing it can improve performance and varies across tasks.
Contribution
It introduces a data-influence-score calculation method for programming tasks and assesses its effectiveness in improving Code-LLMs.
Findings
Data-influence-score filtering improves model performance.
Beneficial training data criteria vary across programming tasks.
Validation-set-loss based filtering is feasible and effective.
Abstract
Recent advancements in code large language models (Code-LLMs) have demonstrated remarkable capabilities in resolving programming related tasks. Meanwhile, researchers have recognized that the quality of pre-training data is crucial for improving LLM performance. However, most of the existing research on pre-training data filtering has focused on general datasets, and little attention for programming datasets. In this paper, we aim to address this gap by exploring the effectiveness of a widely used general data filtering technique, i.e., data-influence-score filtering, within the context of programming-related datasets. To this end, we first introduce a method for calculating data-influence-score for generative programming tasks which involves transforming a variety of downstream coding tasks into validation sets and using the models loss on these sets as a performance metric. Next, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
