Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining
Yuxiang Wei, Hojae Han, Rajhans Samdani

TL;DR
Arctic-SnowCoder-1.3B is a data-efficient code model trained on 555B tokens with a multi-phase approach, achieving state-of-the-art performance on challenging coding benchmarks despite limited data, emphasizing the importance of data quality and distribution alignment.
Contribution
The paper introduces Arctic-SnowCoder-1.3B, a novel multi-phase data selection and training method that improves code model performance with less data than previous models.
Findings
Achieves state-of-the-art results on BigCodeBench
Outperforms similarly sized models trained on more data
High-quality data aligned with downstream tasks is crucial
Abstract
Recent studies have been increasingly demonstrating that high-quality data is crucial for effective pretraining of language models. However, the precise definition of "high-quality" remains underexplored. Focusing on the code domain, we introduce Arctic-SnowCoder-1.3B, a data-efficient base code model pretrained on 555B tokens through three phases of progressively refined data: (1) general pretraining with 500B standard-quality code tokens, preprocessed through basic filtering, deduplication, and decontamination, (2) continued pretraining with 50B high-quality tokens, selected from phase one by a BERT-style quality annotator trained to distinguish good code from random data, using positive examples drawn from high-quality code files, along with instruction data from Magicoder and StarCoder2-Instruct, and (3) enhanced pretraining with 5B synthetic data created by Llama-3.1-70B using…
Peer Reviews
Decision·Submitted to ICLR 2025
Arctic-SnowCoder demonstrates remarkable strengths among small size model, particularly in achieving state-of-the-art results on BigCodeBench with a 36% performance improvement over Phi-1.5-1.3B, despite using only 555B tokens compared to models trained on trillions of tokens. Arctic-SnowCoder-1.3B outperforms StarCoderBase-3B across all benchmarks and surpasses StarCoder2-3B, trained on over 3.3T tokens, on HumanEval+ with a score of 28.0 compared to 27.4. The model also achieves competitive re
While the synthetic data significantly boosts performance, as seen in the 36% improvement over Phi-1.5-1.3B on BigCodeBench, an overreliance on synthetic data risks skewing the model’s understanding of practical coding tasks. Additionally, the performance on HumanEval+ (28.0) and MBPP+ (42.9), although impressive, shows only incremental improvements over models like StarCoder2-3B (27.4 on HumanEval+ and 49.2 on MBPP+), indicating room for optimization in handling more complex or diverse programm
+ Important Area. The authors address a critical aspect of language model development—high-quality data in the code domain—which is essential for improving model performance and applicability. + Good Performance on BigCodeBench Arctic-SnowCoder-1.3B demonstrates good results, achieving state-of-the-art performance on BigCodeBench and surpassing similarly sized models trained on up to 1 trillion tokens, including a notable 36% improvement over Phi-1.5-1.3B.
1. Limited Novelty: While the use of a data annotator to extract high-quality data for pretraining is a valuable approach, it is not entirely novel. Similar methodologies have been employed, such as using GPT-4 as a data annotator. This raises questions about the uniqueness of the authors' contributions. 2. Missing Baselines: The evaluation would benefit from the inclusion of additional baselines, such as OpenAI's GPT models. Comparing or discussing these established models would provide a more
1. This paper proposes a method for improving the performance of pre-training models by focusing on multi-stage data quality enhancement. It introduces a high-performing code model with low token usage. 2. Additionally, the paper analyzestraining strategies, including emphasizing the preparation of training data files and the characteristics of learning rate scheduling.
This paper primarily focuses on techniques for enhancing and filtering the quality of code training data, with a key emphasis on how high-quality, filtered data improves model performance. However, an important question arises: could this improvement come at a cost, such as reduced generalization ability on non-target domain tasks? Additionally, the paper should review some existing techniques for improving training data quality and, where appropriate, include comparative analyses to demonstra
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications · Software Testing and Debugging Techniques · Scientific Computing and Data Management
MethodsBalanced Selection
