Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining

Yuxiang Wei; Hojae Han; Rajhans Samdani

arXiv:2409.02326·cs.CL·September 5, 2024

Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining

Yuxiang Wei, Hojae Han, Rajhans Samdani

PDF

Open Access 3 Reviews

TL;DR

Arctic-SnowCoder-1.3B is a data-efficient code model trained on 555B tokens with a multi-phase approach, achieving state-of-the-art performance on challenging coding benchmarks despite limited data, emphasizing the importance of data quality and distribution alignment.

Contribution

The paper introduces Arctic-SnowCoder-1.3B, a novel multi-phase data selection and training method that improves code model performance with less data than previous models.

Findings

01

Achieves state-of-the-art results on BigCodeBench

02

Outperforms similarly sized models trained on more data

03

High-quality data aligned with downstream tasks is crucial

Abstract

Recent studies have been increasingly demonstrating that high-quality data is crucial for effective pretraining of language models. However, the precise definition of "high-quality" remains underexplored. Focusing on the code domain, we introduce Arctic-SnowCoder-1.3B, a data-efficient base code model pretrained on 555B tokens through three phases of progressively refined data: (1) general pretraining with 500B standard-quality code tokens, preprocessed through basic filtering, deduplication, and decontamination, (2) continued pretraining with 50B high-quality tokens, selected from phase one by a BERT-style quality annotator trained to distinguish good code from random data, using positive examples drawn from high-quality code files, along with instruction data from Magicoder and StarCoder2-Instruct, and (3) enhanced pretraining with 5B synthetic data created by Llama-3.1-70B using…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 5

Strengths

Arctic-SnowCoder demonstrates remarkable strengths among small size model, particularly in achieving state-of-the-art results on BigCodeBench with a 36% performance improvement over Phi-1.5-1.3B, despite using only 555B tokens compared to models trained on trillions of tokens. Arctic-SnowCoder-1.3B outperforms StarCoderBase-3B across all benchmarks and surpasses StarCoder2-3B, trained on over 3.3T tokens, on HumanEval+ with a score of 28.0 compared to 27.4. The model also achieves competitive re

Weaknesses

While the synthetic data significantly boosts performance, as seen in the 36% improvement over Phi-1.5-1.3B on BigCodeBench, an overreliance on synthetic data risks skewing the model’s understanding of practical coding tasks. Additionally, the performance on HumanEval+ (28.0) and MBPP+ (42.9), although impressive, shows only incremental improvements over models like StarCoder2-3B (27.4 on HumanEval+ and 49.2 on MBPP+), indicating room for optimization in handling more complex or diverse programm

Reviewer 02Rating 6Confidence 3

Strengths

+ Important Area. The authors address a critical aspect of language model development—high-quality data in the code domain—which is essential for improving model performance and applicability. + Good Performance on BigCodeBench Arctic-SnowCoder-1.3B demonstrates good results, achieving state-of-the-art performance on BigCodeBench and surpassing similarly sized models trained on up to 1 trillion tokens, including a notable 36% improvement over Phi-1.5-1.3B.

Weaknesses

1. Limited Novelty: While the use of a data annotator to extract high-quality data for pretraining is a valuable approach, it is not entirely novel. Similar methodologies have been employed, such as using GPT-4 as a data annotator. This raises questions about the uniqueness of the authors' contributions. 2. Missing Baselines: The evaluation would benefit from the inclusion of additional baselines, such as OpenAI's GPT models. Comparing or discussing these established models would provide a more

Reviewer 03Rating 5Confidence 3

Strengths

1. This paper proposes a method for improving the performance of pre-training models by focusing on multi-stage data quality enhancement. It introduces a high-performing code model with low token usage. 2. Additionally, the paper analyzestraining strategies, including emphasizing the preparation of training data files and the characteristics of learning rate scheduling.

Weaknesses

This paper primarily focuses on techniques for enhancing and filtering the quality of code training data, with a key emphasis on how high-quality, filtered data improves model performance. However, an important question arises: could this improvement come at a cost, such as reduced generalization ability on non-target domain tasks? Additionally, the paper should review some existing techniques for improving training data quality and, where appropriate, include comparative analyses to demonstra

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational Physics and Python Applications · Software Testing and Debugging Techniques · Scientific Computing and Data Management

MethodsBalanced Selection