Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection
Weijie Lyu, Sheng-Jun Huang, Xuan Xia

TL;DR
This paper presents a data selection method for code LLM training that improves efficiency and performance by ensuring distribution consistency and diversity, achieving better results with fewer samples.
Contribution
The proposed approach introduces a parametric data selection method that enhances training efficiency and model performance by focusing on data quality through distribution and diversity optimization.
Findings
Achieves 2.4% and 2.3% improvements on HumanEval and MBPP with only 10K samples.
Outperforms other sampling methods in both performance and efficiency.
Reduces computational costs significantly while maintaining high-quality model training.
Abstract
Recent advancements in large language models (LLMs) have significantly improved code generation and program comprehension, accelerating the evolution of software engineering. Current methods primarily enhance model performance by leveraging vast amounts of data, focusing on data quantity while often overlooking data quality, thereby reducing training efficiency. To address this, we introduce an approach that utilizes a parametric model for code data selection, aimed at improving both training efficiency and model performance. Our method optimizes the parametric model to ensure distribution consistency and diversity within the selected subset, guaranteeing high-quality data. Experimental results demonstrate that using only 10K samples, our method achieves gains of 2.4% (HumanEval) and 2.3% (MBPP) over 92K full-sampled baseline, outperforming other sampling approaches in both performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
